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LOW-BIT  RATE  SPEECH  ENCODERS 
BASED  ON  LINE-SPECTRUM  FREQUENCIES  (LSFs) 


INTRODUCTION 

Voice  data  in  military  communications  are  increasingly  being  encoded  by  digital  rather  than  analog 
waveforms  because  digital  encryption  for  security  reasons  is  both  easier  and  less  vulnerable  to  unau¬ 
thorized  decryption.  The  data  rate  of  unprocessed  speech  is  64,000  bits  per  second  (b/s),  but  this  rate 
is  reduced  to  2400  b/s  for  transmission  over  narrowband  channels  (having  a  bandwidth  of  approxi¬ 
mately  3  KHz).  The  recently  developed  narrowband  linear  predictive  coder  (LPC)  operating  at  2400 
b/s  is  such  an  example,  and  it  is  expected  to  be  deployed  extensively  in  the  near  future.  The  2400-b/s 
LPC  has  been  standardized  within  government  agencies  (Federal  Standard  1015  or  MlL-STD-188-113), 
and  it  has  been  adopted  by  NATO  allied  forces  (STANAG  4198). 

Recently,  however,  data  rates  above  and  below  2400  b/s  have  been  gaining  considerable  interest; 
in  particular,  very-low-data-rate  (VLDR)  (i.e.,  800  b/s  or  less)  and  4800  b/s.  Future  research  and 
development  (R&D)  efforts  should  be  expanded  in  these  areas  according  to  a  statement  made  by  Mr. 
Donald  C.  Latham,  Deputy  Under  Secretary  of  Defense  (Communications,  Command,  Control,  and 
Intelligence)  which  was  related  by  the  chairman  of  the  Department  of  Defense  (DoD)  Digital  Voice 
Processor  Consortium  on  January  31,  1984. 

There  is  a  real  need  for  voice  processors  operating  at  these  data  rates.  The  VLDR  voice  processor 
is  for  specialized  military  voice  communication  systems  where  a  reliable  connectivity  is  critically  depen¬ 
dent  on  the  reduced  speech  information  rate.  The  implementation  of  an  800-b/s  voice  processor  is  not 
an  easy  task  because  the  voice  processor  eliminates  approximately  99%  of  the  bit  rate  associated  with 
the  original  speech.  The  intelligibility  scores  measured  by  the  diagnostic  rhyme  test  (DRT)  have  been 
in  the  low  80s  from  all  previous  experimental  800-b/s  voice  processors.  Improvement  in  intelligibility 
is  definitely  desired. 

On  the  other  hand,  an  improved  voice  processor  operatini;  at  4hJ0  b/s  is  also  useful.  It  is  well 
known  that  the  2400-b/s  LPC  does  not  reproduce  indistinct  or  rapidly  spoken  speech.  It  is  also  some¬ 
what  biased  against  female  voices  (DRT  differential  of  approximately  5.5  points).  In  fact,  the  2400-b/s 
LPC  is  a  difficult  device  to  talk  over,  according  to  recent  communicability  tests  conducted  at  the  Naval 
Research  Laboratory  (NRL)  [1].  As  illustrated  in  Fig.  1,  communicability  score  for  the  2400-b/s  LPC 
lags  significantly  behind  that  of  a  9600-b/s  LPC.  The  use  of  a  9600-b/s  voice  processor,  however,  may 
not  be  an  ideal  solution  for  all  narrowband  users  because  some  narrowband  channels  cannot  support  a 
data  rate  of  9600  b/s.  A  preferred  solution  would  be  to  use  4800  b/s,  a  data  rate  not  far  above  2400 
b/s.  It  is  interesting  to  recall  that  early  experimental  2400  b/s  LPCs  developed  in  the  mid-1970s  rou¬ 
tinely  incorporated  the  4800-b/$  mode  (and  in  some  cases  the  3600-b/.s  mode  as  well)  to  provide 
improved  speech  quality  at  the  expense  of  a  slightly  higher  data  rate.  For  one  reason  or  another,  such 
an  option  was  gradually  dropped  in  the  recently  developed  2400-b/s  LPC.  This  was  an  unfortunate 
oversight  in  retrospect. 

During  the  development  of  these  voice  processing  algorithms,  we  have  made  considerable  effort 
to  minimize  computational  complexity  in  order  to  make  real-time  operation  feasible  using  present-day 
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Fig,  I  —  Two  conversational  lest  scores  for  various  voice  processors.  The  lower  the  score,  the 
more  effort  is  needed  to  communicate.  This  figure  implies  that  the  2400-b/s  LPC  is  not  an  easy 
device  to  talk  over.  The  telephone  score  is  in  the  lower  90s.  In  the  NRL  Communicability  Test, 
the  subjec: task  is  an  abbreviated  version  of  the  pencil-and-paper  game  "battleship."  In  the  British 
Free  Conversation  Test,  subjects  are  given  some  task  such  as  the  comparison  cf  poirs  of  photo 
graphs  that  induces  the  participants  to  talk  for  about  ten  minutes. 


hardware.  Listening  tests  alone  cannot  evaluate  fully  the  actual  usability  of  a  voice  processor  in  a  two* 
way  communication  link.  Only  conversational  tests  (which  require  two  real-time  processors)  allow  the 
users  to  let  each  other  know  when  communication  has  failed.  The  voice  processing  algorithms 
described  in  this  report  are  well  within  real-time  implementation  using  existing  Navy-owned  special  sig¬ 
nal  processors,  but  roal-time  simulations  have  not  yet  been  performed. 

The  Navy  relies  heavily  on  narrowband  channels  for  voice  communication.  Because  this  capability 
is  vital  to  the  Nav)  ,  Naval  Research  Laboratory  has  been  performing  R&D  on  narrowband  voice  pro¬ 
cessors.  In  1973,  NRL  developed  one  of  the  first  narrowband  L^Cs  capable  of  running  in  real  time. 
Since  1975,  NRL  has  investigated  and  sponsored  several  different  speech  encoding  techniques  designed 
to  operate  at  600  to  800  b/s.  In  1981,  NRL  and  Motorola  produced  a  miniaturized  2400-b/s  LPC  that 
is  only  slightly  larger  than  a  standard  desk  telephone.  During  1982  and  1983,  NRL  conducted  studies 
to  improve  the  2400-b/s  LPC  without  altering  the  interoperability  requirements  specified  by  Federal- 
Standard  101 S.  Now,  a  study  has  been  made  to  implement  both  800-  and  4800-b/s  voice  processors. 
This  report  is  a  result  of  continuing  efforts  by  NRL  to  make  narrowband  voice  processors  more  accept¬ 
able  to  general  users  with  diversified  operating  conditions. 

BACKGROUND 

Since  Dudley  invented  what  is  known  as  the  vocoder  (derived  from  voice  coder  (2])  some  40  years 
ago,  the  fundamental  principle  employed  by  the  narrowband  voice  encoder  has  not  significantly 
changed.  As  illustrated  in  Fig.  2,  the  speech  synthesizer  consists  of  a  filter  representing  the  vocal  tract 
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Fig.  2  —  Simplified  eleictricai  anal  :.gue  speech  generation  by  a  narrowband 

voice  processor.  To  generate  contint  .-  ch,  the  parameters  (pitch  period, 

voicing  states,  filter  weights,  ana  loudr.r  j  updated  40  to  SO  times  per  second. 


and  an  excitation  source  that  drives  the  filter.  The  excitation  signal  is  usually  one  of  two  signals:  ran¬ 
dom  noise  for  the  production  of  unvoiced  sounds  (i.e.,  consonants),  or  a  pulse  train  for  the  production 
of  voiced  sounds  (i.e.,  vowels).  Such  an  excitation  signal  is  still  in  use  in  current  narrowband  LPCs 
although  there  have  been  several  different  filters  implemented. 

The  original  device  by  Dudley,  the  channel  vocoder,  used  ten  contiguous  narrowband-bandpass 
filters  [2].  Subsequent  channel  vocoders  most  commonly  used  16  channels,  but  some  had  as  many  as 
19  channels.  Well-designed  channel  vocoders  achieved  DRT  scores  in  the  upper  80s  at  2400  b/s.  The 
channel  vocoder  resistance  to  transmission-bit  errors  is  remarkable  because  an  error  in  a  single  filter 
parameter  results  in  synthesized  speech  distortion  only  in  that  particular  frequency  band.  According  to 
previous  tests,  processed  speech  retains  an  acceptable  speech  intelligibility  (i.e.,  DRT  score  of  81)  even 
with  5%  random-tiansmission-bit  errors.  Consonant  sounds  are  rather  realistic  for  certain  channel 
vocoders  because  the  passband  extends  up  to  7  kHz  or  more.  But,  a  major  limitation  of  the  channel 
vocoder  in  general  is  its  unnatural  vowel  sounds  somewhat  akin  to  that  of  speech  sounds  propagated 
through  a  long  hollow  pipe. 

Speech  quality  is  greatly  improved  if  the  amplitude  response  of  the  filter  shown  in  Fig.  2  models 
directly  the  speech-spectral  envelope.  Such  a  vocoder  has  been  devised  and  is  known  as  the  spectral- 
envelope-estimation  vocoder  (3).  In  this  vocoder,  speech  samples  are  converted  to  the  spectral 
envelope  using  a  2S6-point  (FFT),  and  the  resulting  log-spectral  envelope  is  down-sampled  by  a  factor 
of  3.  Thus,  the  speech-spectral  envelope  is  represented  by  128/3  =  42  points. 

A  significant  amount  of  data-rate  reduction  is  achieved  b/  the  formant  vocoder  which  transmits 
speech-spectral  characteristics  only  near  the  resonant  frequencies.  The  vocal-tract  filter  may  be 
represented  by  three  or  more  finite-Q  resonators  connected  either  in  series  or  parallel.  If  the  resonators 
are  connected  in  series,  relative  formant  amplitudes  are  not  required  [4,5].  This  is  similar  to  an  all-pole 
filter  that  does  not  require  information  about  the  individual  pole  residue  as  is  discussed  below.  If  the 
resonators  are  connected  in  parallel,  relative  formant  amplitudes  are  required  to  acfjiist  channel  gains 
[6],  In  either  configuration,  formant  bandwidth  may  be  transmitted,  or  it  may  be  assigned  at  the 
receiver  causing  degraded  speech  but  achieving  a  lower  bit  rate. 

Since  the  early  1970s,  an  all-pole  filter  has  been  used  more  extensively  as  the  vocal-tract  filter  in 
narrowband  voice  processors  [7].  The  notable  example  is  the  current  narrowband  LPC.  One  advantage 
of  using  an  all-pole  filter  is  that  its  phase  response  is  a  continuous  function  of  frequency  (unlike  that  of 
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a  continuous'filter  bank  employed  by  the  channel  vocoder).  In  addition,  the  amplitude  response  of  1 1 
all-pole  niter  is  rather  faithful  to  the  speech- spectral  envelope.  As  an  added  advantage,  both  the  filttir 
coefficient  generation  and  speech  synthesis  are  computationally  efTicient.  Unfortunately,  a  miuor  draw¬ 
back  is  a  lack  of  robustness  under  a  bit-error  condition.  An  error  in  any  one  filter  coefficient  cau.<;es 
speech  spectral  distortions  over  the  entire  passband,  not  only  in  resonant  frequencies  but  also  in 
resonant  amplitude.  Note  that  resonant  amplitudes  of  an  all-pole  filter  are  specified  implicitly  by 
resonant  frequencies.  A  S%  random-bit  error  can  reduce  a  DRT  score  by  as  much  as  22  points,  in  on- 
trast  to  only  7  points  for  the  channel  vocoder.  Thus,  some  form  of  forward-error  protection  is  desirable 
in  LPC,  as  provided  in  the  government  standardized  narrowband  LPC. 

We  can  represent  an  all-pole  filter  in  many  ways.  The  most  direct  representation  is  a  positive 
feedback  loop  with  a  transversal  filter  in  the  feedback  loop  (see  the  appendix).  In  this  representation, 
the  filter  coefficients  are  prediction  coefficients.  Prediction  coefficients  are  the  weighting  factors 
appearing  in  the  basic  prediction  equation;  namely,  a  time  sample  is  expressed  as  a  weighted  sum  of 
past  samples.  These  coefficients  are  actually  not  well  suited  for  transmission  because  a  bit  error  in  any 
one  coefficient  can  cause  the  synthesis  filter  to  become  unstable.  Although  there  are  many  ways  of 
checking  the  filter  stability  at  the  receiver  (i.e.,  Hurwitz-Routh  criterion  and  Schur-Cohn  criterion  are 
among  the  best  known  [8]),  they  all  need  a  fair  amount  of  computation.  In  retrospect,  one  of  the  most 
significant  factors  contributing  to  a  successful  implementation  of  the  current  narrowband  LPC  was  the 
choice  in  the  early  1970s  to  transmit  reflection  coefficients  rather  than  prediction  coefficients. 

Reflection  coefficients  are  transformed  prediction  coefficients  which  are  also  the  coefficients  of  an 
all-pole  filter  represented  by  a  cascaded-lattice  filter  (see  the  appendix).  The  advantage  of  transmitting 
reflection  coefficients  is  that  the  stability  of  the  synthesis  filter  is  assured  if  the  magnitude  of  each  coef¬ 
ficient  is  between  plus  and  minus  one.  In  fact,  the  synthesis  filter  of  the  narrowband  LPC  never 
becomes  unstable  because  the  coefficient  coding/decoding  tables  do  not  yield  coefficients  which  can 
give  rise  to  filter  instability.  The  weakness  of  reflection  coefficients  is  that  a  change  in  one  coefficient 
causes  speech  spectral  changes  in  the  entire  passband. 

To  overcome  this  weakness,  we  present  another  representation  of  an  all-pole  filter  for  use  in  both 
the  800-  and  4800-b/s  voice  processors.  In  this  all-pole  filter  representation,  the  parameters  are  line- 
spectrum  frequencies  (LSFs)  (i.e.,  resonant  frequencies  with  an  infinite-Q  or  discrete  frequencies).  It 
is  worthwhile  to  note  that  the  vocal-tract-filter  parameters  are  once  again  frequency-domain  parameters 
as  they  are  in  the  channel  vocoder  and  formant  vocoder.  As  depicted  in  Fig.  3,  LSFs  may  be  obtained 
from  prediction  coefficients  via  a  transformation;  similarly,  reflection  coefficients  may  be  obtained  from 
prediction  coefficients  via  a  transformation.  An  advantage  of  using  LSFs  is  that  the  error  in  one  LSF 
affects  the  synthesized  spectrum  near  that  frequency.  Another  advantage  for  using  LSFs  is  that  they 
may  be  more  readily  quantized  in  accordance  with  properties  of  auditory  perception  to  save  bits  (i.e., 
coarser  quantization  of  the  higher  frequency  spectral  components). 

Prediction  coefficients  may  be  transformed  into  LSFs  through  the  decomposition  of  the  pulse 
response  of  the  LPC-analysis  filter  into  even-  and  odd-time  sequences  (Fig.  4).  This  decomposition  is 
reversible  because  the  original  pulse  response  can  be  obtained  by  half  the  sum  of  the  even-  and  odd¬ 
time  sequences.  As  will  be  shown,  the  even-time  sequence  has  roots  along  the  unit  circle  in  the  com¬ 
plex  plane.  Thus  the  even-time  sequences  may  be  represented  by  LSFs.  Likewise,  the  odd-time 
sequence  has  roots  along  the  unit  circle  in  the  complex  plane.  Hence,  the  odd-time  sequence  may  also 
be  represented  by  LSFs. 

Line-spectrum  representation  of  prediction  coefficients  was  first  made  public  by  F.  Icakura  in  1975 
at  the  89th  meeting  of  the  Acoustic  Society  of  America  [9].  Since  then,  applications  of  LSFs  have  been 
pursued  mainly  in  Japan.  It  is  significant  to  note  that  among  17  different  speech  synthesis  chips  which 
Japan  produced  during  the  past  several  years,  one  of  them— ECL- 1565— uses  LSFs  as  filter  parameters 
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Eq«.  (24)  through  1321 


Eqs.  133)  through  (351 


Fig.  3  —  Parameters  related  to  linear  predictive 
analysis  When  a  time  sample  is  expressed  as  a  linear 
combination  of  past  samples,  the  weighting  factors  are 
prediction  coefficients  (see  the  appendix).  Prediction 
coefficients  are  never  transmitted  as  speech  parame¬ 
ters  because  bit  errors  can  cause  the  synthesis  filter  to 
become  unstable.  Reflection  coefficients  are  filter 
parameters  which  may  be  obtained  directly  from 
speech  samples  or  from  prediction  coefficients  (see  the 
appendix).  The  synthesis  filter  Is  stable  as  tong  as  the 
magnitude  of  each  reflection  is  confined  between  plus 
and  minus  one.  Currently,  all  LPC-based  voice  pro¬ 
cessors  transmit  reflection  coefficients.  LSFs  and 
prediction  coefficients  are  mutually  transformable. 
Like  reflection  coefficients,  LSFs  have  their  own 
unique  properties  to  consider  for  optimum  quantiza¬ 
tion. 


Prediction 

Coefficients 


f  Line-Spectrum 
Frequencies  (LSFs) 


Eqs  1361  &  1371 


Eq. (23) 
of 

Appendix 


Eq.  (11) 
of 

Appendix 


Reflection 

Coefficients 


(a)  Pulse  response  of  tenth-order  LPC  analysis  filter  in  which  a\  through 
dio  are  prediction  coefficients 


(b)  Time-shifted  and  time-reversed  waveform  of  (a) 


(c)  Siiiii  of  waveforms:  (a)  plus  (b).  The  time  sequence  is  even  sym¬ 
metric  with  respect  of  its  midpoint.  All  roots  of  this  sequence  are  along 
the  unit  circle  of  the  complex  plane,  with  a  real  root  at  -  I.  Note  that  the 
first  and  last  samples  are  both  1. 


(d)  Difference  of  waveforms:  (a)  minus  (b).  The  time  sequence  is  odd 
symmetric  with  respect  to  its  midpoint.  Ail  roots  of  this  sequence  are  also 
along  the  unit  circle  of  the  complex  plane,  with  a  real  root  at  1. 


Pig.  4  —  The  response  of  the  LPC-analysis  Alter  decomposed  into  even- 
and  odd-time  sequences.  This  decomposition  is  the  basis  of  the  transfor¬ 
mation  from  prediction  coefficients  to  LSFs.  The  amplitude  spectra  of 
these  waveforms  are  shown  in  Figs.  7  and  8. 
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[10].  In  the  United  States,  however,  very  little  work  has  been  done  in  this  area.  Only  recently,  a 
conference  paper  [11]  and  an  article  in  a  periodical  [12]  have  been  published  related  to  this  topic. 

Our  report  also  explores  the  application  of  LSFs.  We  emphasize  the  implementation  of  an  800- 
b/s  pitch-excited  LPC  and  a  4800-b/s  nonpitch-excited  LPC.  Synthesized  speech  derived  from  LSFs 
using  these  two  implementations  was  evaluated  for  the  first  time  by  formalized  test  procedures  using 
the  DKT.  In  addition,  this  report  contains  the  derivation  of  all  the  necessary  equations,  results 
obtained  from  an  investigation  of  the  parameter  sensitivities  on  spectral  distortions,  and  outcome  of  a 
perceptual  experiment  using  sounds  generated  from  LSFs. 

LSFs  AS  FILTER  PARAMETERS 

Currently,  the  most  frequently  used  parameters  for  the  LPC-analysis  and  synthesis  filters  are 
prediction  coefficients  or  reflection  coefficients.  This  section  derives  LSFs  which  are  equivalent  hlter 
parameters. 

LPC-Analysis  Filter 

LPC-analysis  filter  transforms  speech  samples  into  prediction  residual  samples.  The  most  com¬ 
monly  used  filter  parameters  have  been  either  prediction  coefficients  or  reflection  coefficients  (see  the 
appendix).  A  functionally  equivalent  LPC-analysis  filter  may  be  constructed  from  the  sum  of  two 
filters  with  even  and  odd  symmetries.  The  basic  principle  is  similar  to  the  decomposition  of  an  arbi¬ 
trary  function  into  a  sum  of  even  and  odd  functions  [13]. 

We  can  express  the  transfer  function  of  the  nth -order  LPC-analysis  filter  as 

Aniz)  -•  1  -  aj  Z"‘  -  02  (1) 

where  a,  is  the  Hh  prediction  coefficient  of  the  nth-order  predictor  (i.e.,  a/  is  a  simplified  notation  of 
Ojin  used  in  the  appendix),  and  z“'  is  a  one-sample  delay  operator.  The  recursive  relationship  of 
A„+\(z)  in  terms  of  A„(z),  as  noted  in  the  appendix,  is 

A„+i(z)  -  A„(z)  -  k„+i  z~^''*^^A„  (z“'),  (2) 

where  k„+\  is  the  (n  +  l)th  reflection  coefficient  which  equals  o„+i  of  the  (n  +  Dth-order  predictor. 

Let  P„+i(z)  be  /4„+i(z)  with  k„+]  -  1  (i.e.,  an  open-end  termination).  Thus, 

P„+\(z)  -  A„iz)  -  z~^"'*''^/4„(z"').  (3) 

Likewise,  let  Q„+i(z)  be  /4„+i(z)  with  k„+i  -  -1  (i.e.,  a  clcsed-end  termination).  Thus, 

Q„+i{z)  -  A„(z)  +  z~^''*^^A„{z~^).  (4) 

From  Eqs,  (3)  and  (4),  we  express  A„iz)  as 

/l«(z)- |[F„+,(z)-(- G„+,(z)].  (5) 

Equation  (5)  is  an  alternative  form  of  the  LPC-analysis  filter.  The  nature  of  P„+\(z)  and  Q„+i(z)  is  yet 
to  be  discussed. 

P„+i(z)  of  Eq.  (3)  may  be  considered  as  the  pulse  response  of  a  difference  filter  which  takes  the 
difference  between  the  nth-order  LPC-analysis  filter  output  and  its  cot\jugate  filter  output.  Note  that 
P„+\{z)  is  odd-symmetric  with  respect  to  its  midpoint  (Fig.  4).  On  the  other  hand,  Q„+\(z)  of  Eq.  (4) 
may  be  considered  as  the  pulse  response  of  a  sum  filter  which  takes  the  sum  of  the  A„(.z)  output  and  its 
conjugate  filter  output.  Note  that  0(1+1  (z)  is  cven-symmetrtic  with  respect  to  its  midpoint  (Fig.  4). 


This  kind  of  Alter  composition  need  not  be  associated  with  the  speech  analys 
filter.  The  decomposition  expressed  by  Eq.  (5)  holds  for  any  arbitrary  finite 
filter.  Similar  decomposition  has  been  exploited  in  the  stability  study  of  iinei 

To  show  that  alt  roots  of  P„+iiz)  have  roots  along  the  unit  circle  in  th 
by  Eq.  (1)  is  substituted  for  A„(z)  in  Eq.  (3).  'thus, 

P„+i(.2)  -  1  +  flir"'  +  +  . . .  +  a„/2 

I  t 
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where 


di(  -  —2  cos  (fl*),  0  <  0*  <  TT, 

-  -2  cos  (lirficts),  (10) 

in  which  /*  is  the  /cth  LSF  in  hertz  associated  with  P N  +  l  (z),  and  t,  is  the  speech  sampling  time  interval. 

Likewise,  to  show  that  all  roots  of  Q/I+  i(z)  have  roots  along  the  unit  cricle  in  the  z-plane,  A„(z) 
expressed  by  Eq.  (1)  is  substituted  for  A„(2)  in  Eq.  (4).  Thus, 

Q„+l(z)  “  1  +  Cl2"'  +  C2Z-^  +  .  .  .  +  C„/2Z""^^ 

+  C„/2Z“<'''^*+'  +  .  . .  +  C22"'’+‘  +  CiZ-"  +  Z  -  (1  1) 


where 


Cl  -  -  ai-a„ 

C2  -  -  02  -  o„_i 


<^n/2  “  “  “fl/2  ~  “(«/2)-l  ‘ 

Since  Q„+t{z)  has  an  even  symmetry  with  respect  to  its  center  coefficient,  z  -  -1  is  a  root  (this  real 
root  is  an  artifact  introduced  in  the  decomposition  of  A„(z)).  Thus,  we  can  factor  Eq.  (11)  into 

0„+l(z)  -  (1  +  Z”*)[l  +  diZ“'  4-  d2Z"^  +  ...  +  c/(„/2)..i2“*''^^’“' 

+  d„nz-''^‘^  +  . . .  +  +  <^12-"+'  +  z-"],  (13) 


where 


<fi  —  1  +  d 

4/2  “  1  —  Cl  +  C2 

4/3  -  -  1  +  Cl  -  C2  +  Cj. 


The  quantity  inside  the  brackets  of  Eq.  (13)  has  a  similar  form  to  that  of  P„+\{z)  in  Eq.  (8). 
Thus,  Q„+\(.z)  may  be  factored  as 


where 


a+i(z)  -  (1  +  z-')n(i  +  d;z-'  +  z-2). 


di  -  -2  cos  O.1T fit,) , 


and  fl  is  the  ilcth  LSF  associated  with  C„4.i(^).  Figure  5  is  a  block  diagram  of  the  LPC-analysis  filter  in 
which  filter  parameters  are  LSFs. 
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Reeldual 

O.Jt 


dj  =  -  2cos  I2nfit,l 
d'j  =  -  2co8  (2i'f'jtjl 

Fig.  5  -  Block  diagram  of  an  nth-order  LPC-analysis  filter  with  LSFs  as  filter  weights  (n  is  even).  This  filter  is  functionally 
identical  to  the  LPC-analysis  filters  with  prediction  coefficients  or  reflection  coefficients  as  filter  weights  (see  the  appendix).  As 
discussed  later,  LSFs  are  naturally  ordered  with  /^  <  f'\  <  fi<  f\  ■■■■ 

LPC-Synthesls  Filter 

The  LPC-synthesis  filter  is  the  inverse  of  the  LPC-analysis  filter.  Thus,  the  transfer  function  of 
the  LPC-synthesis  filter,  //„(z),  is 


|[P,+,(z)  +  0,+,(z)] 


1- j[i-p„+,(z)]  +  |  [1-a+iW] 

Equation  (17)  is  the  form  of  a  positive  feedback  in  which  the  feedback  element  is  the  quantity  inside 
the  bracket.  Expressing  this  feedback  element  in  terms  of  LSFs,  we  obtain 


f(z)-  Y  [l-P„+,(z)]4-  [1-  0„+,(z)] 


1- (i-z-')q(i +  t4z"'  +  2"^) 


-F  1  - 


(I  -i-  z-')n il  +  dlz-'  +  z-^) 


-  4-  1  -  nd  +  d.z--^  +  z-2)  +  z-‘n(i  + 


+  1  -  nd  +  t/*z-' 4- z-2)  -  z-'nd  +  t/*z-'  + z-2) 


^  • .  -  •  • . 
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In  Eq,  (18),  both  the  second  and  fourth  terms  may  be  realized  by  cascading  second-order  filters 
identical  to  those  appearing  in  the  LPC-anatysis  filter  shown  by  Fig.  5.  The  first  and  third  terms  (which 
do  not  appear  in  the  LPC-anaiysis  fiiter)  may  be  implemented  more  economically  if  the  following  recur¬ 
sive  relationship  is  exploited.  Let  <7^(z)  represent  either  the  first  or  the  third  term  of  Eq.  (18).  Thus, 

-  i  -  ]q(l +^*2“' +  2"^).  K  n/2.  (19) 

Gn  (z)  in  terms  of  _  i  (z)  may  be  expressed  as 

Gk(z)^  -  2"'(«/^- +  z“')  J^(1 -I- -I- r"^) -i- (/.v-i(rf)-  (20) 

Note  that  the  first  term  of  Eq.  (20)  without  z~'  is  available  from  each  section  of  the  second-order  filter 
(output  of  the  first  summer  in  each  of  the  heavy-lined  boxes  in  Fig.  6).  Thus,  the  total  operations 
required  to  compute  either  the  first  or  third  terms  of  Eq.  (18)  are  n/2  summations.  The  LPC-synthesis 
filter  is  shown  in  Fig.  6. 


I  «i:ll.itlill'. 


Im  ft  —  An  nlh-orilcr  Ll’C'-synlhcsis  filler  with  L.SPs  as  filler  wcighls  (n  is  even).  This  filler  is  funelionally  idenlieiil  lo  ihe 
l.l’C'-synlhesis  fillers  wiih  iiredielion  coclfieicnls  or  rcl’eclion  eocITicicnls  as  Hller  weiuhls  (see  Ihe  appendix). 


FILTER  PARAMETER  TRANSFORMATIONS 

As  indicated  in  Fig.  3,  we  can  transform  a  set  of  prediction  coefficients  into  a  set  of  LSFs,  and 
vice  versa.  This  section  presents  these  conversion  algorithms. 

Conversion  of  Prediction  Coefficients  to  LSFs 

Conversion  of  prediction  coefficients  to  LSFs  is,  in  essence,  finding  roots  of  the  difference  and 
sum  filters  expressed  previously  as  Eqs.  (3)  and  (4): 
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?„+i(2)  -  i4„(z)  -  2"  ^"■'■‘*i4,(2“‘)  (difference  filter),  (3) 

Ofl+iCz)  -  A^{z)  +  2“  <"■*■'*^4,(2"')  (sum  filter).  (4^ 

Since  the  roots  of  either  ^«+i  (2)  or  iic  only  along  the  utiit  circle  of  the  z-plane,  the  use  of  gen¬ 

eralized  root-finding  procedures  is  more  than  what  is  required  for  the  present  application.  A  preferred 
approach  is  a  numerical  method  for  determining  the  frequencies  which  make  the  amplitude  response  of 
either  the  difference  or  sum  filter  zero.  In  this  approach,  both  (2)  and  Qh+\  (2)  are  evaluated  for 
various  values  for  2  (where  2  is  exp(/27rftj)  and  J  «  from  a  frequency  of  0  Hz  to  the  upper  cut¬ 
off  frequency.  This  form  of  computation  is  systematic;  hence,  the  necessary  software  is  relatively  com¬ 
pact.  Furthermore,  the  maximum  amount  of  computation  needed  to  find  all  the  LSFs  is  fixed,  contrary 
to  those  methods  which  rely  on  the  convergence  of  the  solution  (viz.,  Newton’s  method).  Information 
regarding  the  maximum  number  of  computational  steps  is  important  for  the  implementation  of  the 
algorithm  in  a  real-time  voice  processor. 

If  the  difference  and  sum  of  two  quantities  contain  all  the  necessary  information,  then  the  ratio  of 
the  same  two  quantities  also  contains  the  same  information.  The  use  of  the  ratio  filter,  an  alternative 
approach  for  finding  LSFs,  is  based  on  the  rearranged  expression  for  and  Q„+\(z): 

P„+i(z) A„(z)[l  -  R„^i(z)],  (21) 

a+i(2)-^„(2)tl (2)1,  (22) 

where 

(,-l) 

/?„+i(2)- - t-A -  (ratio  filter).  (23) 

z) 

The  ratio  filter  is  an  all-pass  filter  (i.e.,  a  phase  shifter  with  a  flat  amplitude  response).  When  the  phase 
angle  of  the  ratio  filter  is  a  multiple  of  ir  radians,  the  amplitude  response  of  the  difference  filter  is  zero. 
On  the  other  hand,  when  the  phase  angle  of  the  ratio  filter  is  zero  (or  a  multiple  of  2  radians),  the 
amplitude  response  of  the  sum  filter  is  zero.  Hence,  frequencies  which  give  rise  to  these  two  phase 
angles  are  the  LSFs.  We  describe  in  detail  the  computational  steps  for  these  two  approaches. 

Approach  1:  Using  Amplitude  Response  of  Sum  and  Difference  Filters 

We  begin  computation  with  a  set  of  prediction  coefficients  generated  by  LPC  analysis  using  one 
frame  of  speech  samples  (approximately  100  to  200  samples). 

(a)  The  coefficients  of  the  LPC-analysis  filter  transfer  function  are  generated  from 


(i«(l)-l,  (24a) 

a„(i)  -  -  a,_i,  /  -  2, 3,  ...  ,  n  +  1,  (24b) 

where  a„(i)  is  the  /th  coefficient  of  A„(z)  and  /  is  the  /th  prediction  coefficient.  The  frequency 
response  of  A„(z)  is  illustrated  in  Figs.  7(a)  and  8(a). 

(b)  The  coefficients  of  the  difference  filter  are  generated  by 

P„+i(l)  -  1,  (25a) 

/>„+i(/)  -  a„(i)  -  a„(«  4-  3  -  /),  /  -»  2,3, . . .  ,  n  +  2,  (25b) 


where  p„+i(i)  is  the  /th  coefficient  of  P„.t.i(z).  The  frequency  response  of  P„+i(2)  is  shown  in  Figs. 
7(b)  and  8(b). 
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(d)  Croup  delay  of  ratio  Alter 

Fig.  7  —  Frequency  response  of  the  LPC-analysis  filter  and  other  derivative  Filters  containing 
LSF  information  (Example  #1).  Figure  7(a)  is  the  tenth-order  LPC-analysis-Filler-amplitude 
response  associated  with  speech  sound  /o/  in  "strong."  The  LPC-synthesis-Filter-amplitude 
response  is  identical  to  Fig.  7(a)  if  the  vertical  axis  is  labeled  as  "Gain  (dB)”  instead  of 
"Attenuation  (dB)."  LPC-synthesis-Filter-amplitude  response  is  an  all-pole  approximation  to 
the  speech-spectral  envelope.  Figure  7(b)  is  the  amplitude  response  of  the  difference  and  sum 
Fillers  as  dcFned  in  the  text  The  null  frequencies  associated  with  either  Filter  are  LSFs.  To 
sec  the  effect,  a  real  pole  has  not  been  removed  in  either  the  difference  or  sum  Filter.  The 
expression  "line  spectrum"  originated  from  the  shapes  of  these  frequency  responses.  Figure 
7(c)  is  the  phase  response  o."  the  ratio  Filter  defined  in  the  text.  Note  that  when  the  phase 
response  becomes  rr  radians,  the  amplitude  response  of  the  difference  Filter  becomes  null.  On 
the  other  hand,  when  the  phase  response  becomes  0  or  2  radians,  the  amplitude  response  of 
the  sum  Filter  becomes  null.  Figure  7(d)  is  the  group  delay  of  the  ratio  filter.  The  group 
delay  is  large  near  speech-resonant  frequencies,  and  LSFs  are  close  together. 
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(c)  Likewise,  coefficients  of  the  sum  filter  are  generated  by 

(26a) 

q„+\{l)  a„(l)  +  a^in  +  3  -  t),  / ->  2,3, . . .  ,  n  +  2,  (26b) 

where  q„+\(.l)  is  the  IXh  coefficient  of  l5„+i(2).  The  frequency  response  of  Qn+l  (z)  is  shown  in  Figs. 
7(b)  and  8(b). 

(d)  Since  a  real  pole  at  z  »  1  is  an  artifact  introduced  in  the  generation  of  the  difference  filter, 
(1  -  2“')  may  be  factored  out  from  /’„+i(z)  to  simplify  computations.  Thus, 

/»„+,(z)- (l-z-')/»„(2),  (27) 

and  the  coefficient  sequence  of p„(.z)  is 

/»„(!)- /)„+i(l),  (28a) 

and 

P,(l)  -  P«+i(/)  +  -  1).  i  -  2,3 . rt  +  1.  (28b) 

(e)  Likewise,  (1  +  z“‘)  may  be  factored  out  from  the  sum  filter.  Thus, 

(]  +  z-^)  Q„(2).  (29) 

and  the  coefficient  sequence  of  Q„(z)  is 

?„(!)- 1,  (30a) 

and 

q„(i)  -  q„+\ii)  -  q„(t  -  1),  /  -  2,3,...  ,  «  +  1.  (30b) 

(f)  Use  of  the  autocorrelation  sequence  simplifies  the  amplitude  spectral  analysis  because  it  needs 
only  !he  cosine  transform.  The  autocorrelation  sequence  of  p„{l)  is  obtained  from 

n+i-l 

/'n(/)p«(/ +  y  -  1).  /  -  1,2.. .  ,  «  +  2.  (31) 

(g)  The  power  spectrum  of  p„  (/)  is  obtained  from 

^pp  (l<A ^  -  2^7  1 )  +  2  (/  +  1 )  cos  (27r  /fc/,  r, )  *  -  1 ,2 .  (32) 

where  fj  is  the  speech-sampling  time  interval,  and  /,  is  the  frequency  step  in  hertz.  The  choice  of  fre¬ 
quency  step  size  is  a  major  concern  because  a  finer  step  size  increases  computation,  whereas  a  coarser 
step  size  introduces  irreversible  spectral  distortions.  According  to  intelligibility  testing  using  DRT,  fre¬ 
quency  steps  of  10  Hz  did  not  degrade  the  intelligibility  of  synthesized  speech.  Note  that  the  above 
computations  are  terminated  upon  finding  «/2  LSFs. 

(h)  LSFs  are  the  frequencies  which  make  ^ppM  local  minima  (see  Figs.  7(b)  or  8(b)  for  the 
amplitude  response  of  the  difference  filter).  For  the  Mh-order  LPC-analysis  filter,  there  will  be  n/2 
LSFs  (and  one  more  at  the  upper-cutoff  frequency  if  step  (d)  is  omitted). 

(i)  For  the  sum  filter,  steps  (f)  through  (h)  are  repeated  using  q^U)  in  place  of  p„{i).  For  the 
rtth-order  LPC-analysis  filter,  there  will  be  again  n/2  LSFs  (and  one  more  at  0  Hz  if  step  (e)  is  omit¬ 
ted).  Note  that  LSFs  associated  with  the  sum  filter  are  always  above  the  corresponding  LSFs  associated 
with  the  difference  filter. 
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Approach  2:  Using  Phase  Response  of  Ratio  Filter 

The  most  significant  difference  between  this  approach  and  the  previous  one  is  that  the  spectral 
analysis  is  performed  only  once,  contrary  to  twice  in  the  previous  approach.  This  approach,  however, 
requires  both  the  cosine  and  sine  transforms,  whereas  the  previous  approach  needs  only  the  cosine 
transform.  Furthermore,  this  approach  rtciuires  inversc-tar.ger.t  operations.  Thus,  there  is  no  advan¬ 
tage  from  a  computational  point  of  view  This  approach,  however,  allows  interpolation  of  LSFs  for 
finer  resolutions  which  was  impossible  in  the  previous  one.  Most  important,  this  method  provides 
more  readily  the  group  delay  of  the  ratio  filter.  As  will  be  shown,  the  group  delay  at  an  LSF  is  directly 
related  to  the  spectral  sensitivity  of  that  particular  LSF. 


As  noted  from  Eq.  (23),  the  phase  spectrum  of  the  ratio  filter  needs  the  phase  spectrum  of  the 
LPC-analysis  filter,  A„  (z).  The  complex  spectrum  of  the  LPC-analysis  filter  is 

^  "  i-i- 

A„(kf,) ^a„U  +  1)  exp  (- Jlirikfit,),  j  -  y/ -U  (33) 


where  a„(i)  is  the  /th  coefficient  of  the  nth-order  LPC-analysis  filters.  Since  a„(i) 
a„(i)  "  -  a/  where  a/  is  the  /th  prediction  coefficient,  we  can  write  Eq.  (24)  as 


Ajk/,) 


a/  cos  (2irikfft,) 


+ 


a  I  sin  (2irkf,t,) 


and 

(34) 


From  Eqs.  (23)  and  (25),  the  phase  spectrum  of  the  ratio  filter,  denoted  by  >f>ikf,),  is 


<f>(k/,)  -■  -  (n+l)(2iTkf,t,)  -  2  tan“* 


n 

,5 


a,  sin  (2nik/,t,) 


1  “  cos  {2itikf,t,) 


,  )t-l,2 . 


(35) 


Figures  7(c)  and  8(c)  illustrate  the  phase  spectrum  of  the  ratio  filter  computed  by  Eq.  (35).  As  stated 
previously,  LSFs  are  the  frequencies  which  give  rise  to  a  phase  value  of  a  multiple  of  either  -tt  or  -27r 
radians.  Since  the  phase  spectrum  is  a  monotonically  decreasing  function  as  frequency  increases,  LSFs 
can  be  interpolated  for  a  finer  resolution.  In  Eq.  (35),  the  frequency  step  is  10  Hz  as  in  the  previous 
approach.  Above  computation  is  terminated  upon  finding  n  number  of  LSFs. 


Given  a  set  of  prediction  coefficients,  the  use  of  Eq.  (35)  is  the  most  direct  way  for  computing 
the  phase  spectrum  of  the  ratio  filter.  As  shown  later  in  Eq.  (45)  phase  spectrum  of  the  ratio  filter  can 
also  be  computed  (although  not  as  conveniently)  from  a  set  of  roots  of  the  LPC-analysis  filter. 


Conversion  of  LSFs  to  Prediction  Coefficients 


We  can  convert  LSFs  to  prediction  coefficients  by  solving  for  the  coefficients  of  the  polynomial 
which  represent  the  transfer  function  of  the  LPC-analysis  filter,  A„(z),  in  terms  of  LSFs.  To  begin 
with,  A„  (z)  can  be  expressed  as  a  sum  of  two  factored  terms  by  substituting  Eqs.  (9)  and  (15)  into  Eq. 
(5).  Thus, 


A„(z)  - 


(1- 


n/2  nil 

--')j^  (1  +  4  z"'  -I-  z-^)  +  (1  JJ(1  +  +  z"^) 


(36) 


When  the  product  terms  are  multiplied  out,  the  resulting  polynomial  is  in  the  form 

A„(z)  -  1  +  ^iz"*  +  J02Z-2  + 


(37) 
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Comparing  Eq.  (37)  term  by  term  with  Eq.  (1)  indicates  that  the  /th  prediction  coefficient  is  -/3,  for 
/  -  1,2 . n. 

The  coefficients  of  Eq.  (37),  hence  the  solution  for  the  prediction  coefficients,  can  be  obtained 
numerically  by  computing  the  («  +  1)  samples  from  the  LPC-analysis  filter  under  the  excitation  of  a 
single  pulse  (i.e.,  1  followed  ay  n  number  of  zeros).  The  numbei  of  ccmputatioral  steps  neet'.ed  for 
this  solution,  as  determined  from  either  Eq.  (36)  or  the  block  diagram  of  the  LPC-analysis  filter  shown 
in  Fig.  5,  is  w(w  -I-  1)  multiplications  and  (in  +  3)(n  ■¥  1)  summations  as  listed  in  Table  1. 


Table  1  —  Total  Number  of  Arithmetic  Operations  Required  for 
Speech  Synthesis  and  Parameter  Conversion 


To  Synthesize  Speech  Samples 
(Arithmetic  Operations  per  Sample) 

To  Convert  LSFs  to 
Prediction  Coefficients 
(Arithmetic  Operations 
per  Frame") 

Using 
Prediction 
Coefficients 
(Fig.  Al) 

Using 
LSFs 
(Fig.  6) 

Differential 

Multiplications 

Summations 

n 

n 

n 

3«  +  2 

0 

In  -f-  2 

n(n  +  1) 

(In  +3)  (n  +  1) 

‘’Frame  is  usually  160  to  200  simples. 


This  kind  of  parameter  conversion  reduces  the  number  of  computational  steps  in  the  speech- 
synthesis  algorithm.  In  fact,  similar  parameter  conversions  are  currently  being  performed  in  some  nar¬ 
rowband  LPCs  in  which  received  reflection  coefficients  are  converted  to  prediction  coefficients  prior  to 
speech  synthesis.  Table  1  lists  the  total  number  of  arithmetic  operations  necessary  to  synthesize  one 
speech  sample  by  using  two  different  filter  coefficients:  one  using  prediction  coefficients  and  another 
using  LSFs.  The  difference  is  (In  +  2)  summations  per  speecn  sample  in  favor  of  using  prediction 
coefficients. 

As  a  numerical  example,  let  n  be  10  (i.e.,  LPC-synthesis  filter  with  10  filter  weights),  and  frame 
size  be  180  samples.  If  the  prediction  coefficients  are  used  in  lieu  of  LSFs,  3960  summing  operations 
are  saved  for  each  frame.  On  the  other  hand,  parameter  conversion  requires  110  multiplications  and 
253  summations,  but  this  conversion  is  needed  only  once  per  frame.  Thus,  converting  LSFs  to  predic¬ 
tion  coefficients  is  beneficial  for  reducing  computational  steps  during  speech  synthesis. 

PROPERTIES  OF  LSFs 

This  section  discusses  several  properties  associated  with  LSFs.  Spectral  sensitivity  to  LSFs  is  dis¬ 
cussed  in  a  separate  section  because  of  the  importance  this  has  for  influencing  filter-parameter  quanti¬ 
zation. 

Naturally  Ordered  Frequency  Indices 

For  the  «th-order  LPC-analysis  filter,  there  are  always  (n  +  2)  LSFs  (two  are  extraneous  fre¬ 
quencies  which  do  not  contain  speech  information  (Table  2)).  Furthermore,  the  (n  -F  2)  LSFs  are 
naturally  ordered  such  that  the  /th  LSF  in  one  instant  of  time  remains  the  /th  LSF  for  another  instant 
of  time.  Therefore,  trajectories  of  LSFs  are  continuous,  and  they  do  not  intersect  each  other  as  shown 
in  Fig,  9, 
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To  prove  that  the  LSFs  are  naturally  ordered  is  equivalent  to  proving  that  the  phase  angle  of  the 
filter  is  a  monotonic  function  of  frequency.  As  discussed  previously  in  connection  with  the  filter- 
parameter  transformation  from  predicted  coefficients  to  LSFs,  LSFs  are  those  frequencies  which  give 
rise  to  phase  angles  of  the  ratio  filter  of  0,— *r,  -2Tr,  -Btt,  etc.  If  the  phase  response  of  a  filter  is  a 
monotonic  function  of  frequency,  then  its  first  derivative  with  respect  to  frequency  does  not  change  its 
sign  over  the  entire  frequency  domain. 


To  prove  the  above  statement,  the  following  three  quantities  are  obtained  from  the  transfer  func¬ 
tion  of  the  ratio  filter:  frequency  response,  phase  response,  and  group  delay  (i.e.,  first  derivative  of 
phase  response).  From  Eq.  (23),  the  transfer  function  of  the  ratio  filter  is 

/?„+i(2)“  - 7-74 -  (ratio  filter),  (23) 

A„  Kz) 

where  A„(z)  is  the  nth-order  LPC-analysis-filter-transfer  function.  Equation  (23)  may  be  factored  as 

A„(z)^  P^il-  z^z-').  (38) 

where  z^  is  Adh  root  of  A„(z),  and  zi,  lies  inside  the  unit  circle  of  the  z-plane.  Substituting  Eq.  (38)  into 
Eq.  (23)  yields 


R„+t(z) 


(1  -  ^**g) 
(I  -  2*2"') 


(r'-zr) 

(l-^*r')’ 


(39) 


where  r*  is  a  complex  coqjugai:  ->f  2.  Equation  (39)  may  be  represented  as 

/?,+|(2)  -  2“'j]jp*(2), 


in  which 


Pkiz) 


(40) 


(41) 


In  terms  of  a  modulus  and  argument,  the  kth  root  of  A„{z)  is  in  the  form  of 


2* 


-  ycuj 


(42) 


By  substituting  2  for  expOwtj)  and  Eq.  (42)  into  Eq.  (41),  the  complex-frequency  response  of  the  ratio 
filter  is  achieved.  Thus, 


p*(aj) 


I  -  r*e 


1  -  r*e 


—  /(to  to^  )/j  ‘ 


(43) 


A  complex  spectrum  can  be  written  in  the  form 

p*((«>)  ■■ /4*(w)e^***'“*.  (44) 

The  phase  spectrum,  (/>*(aj),  is  the  imaginary  part  of  the  logarithm  of  the  complex  spectrum.  Thus, 

0* (w)  Im  [ln(p* (ta))).  (45) 
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The  group  delay,  denoted  by  is  by  definition 

/)*  (w)  —  —  Pit  (w) 

a<ti 


—  —  Im 


dot 

Substituting  Eq.  (43)  into  Eq.  (46)  yields 

-  t,-— - "  ■  7  -1— f- 

1  —  2r*  cos  (a»  —  ft)*)r,  +  r* 

From  Eq.  (40),  the  total  group  delay  is  a  sum  of  all  partial  delays  and  an  additional  one-sample  delay. 
Thus,  total  group  delay,  denoted  by  /)((u),  is 

Die,)  -  dl  -h  f  — - - (48) 

i^l  —  2  rit  cos  (ci)  -  (Oit)t,  -I-  rif 


In  (0;^  ((d)) 


(46) 


If  all  roots  of  the  LPC-analysis  filter  lie  inside  the  unit  circle  of  the  z-plane  (i.e.,  <  1  for  all 

k's),  the  group  delay  expressed  by  Eq.  (48)  is  positive  for  all  frequencies  regardless  of  the  value  of  cd;^ 
(i.e.,  arguments  of  the  roots  of  >4„(z),  or  speech-resonant  frequencies).  Hence,  the  phase  response  of 
the  ratio  filter  is  a  monotonic  function  of  frequency.  As  a  result,  the  phase  angles  of  0,-ir,-2ir,-3fr, 
etc.  appear  in  sequence  as  frequency  increases.  Thus,  all  LSFs  are  naturally  ordered. 


It  is  worthwhile  to  mention  that  the  ratio  filter  expressed  by  Eq.  (23)  or  (39)  has  application 
beyond  computing  LSFs.  It  has  been  used  as  a  mapping  function  to  transform  one  filter  to  another 
(viz.,  low-pass  filter  to  high-pass  filter,  or  low-pass  filter  to  bandpass  filter)  [14,15].  In  this  application, 
the  new  variable  may  be  defined  as 


z-‘ 


g~*  -  Ik* 

1  -  z*z~'  ’ 


(49a) 


or 


r-l 


JL  Z"*  -  Z*  • 


IJ- 


-1  • 


This  transformation  maps  the  unit  circle  unto  itself  because 

Izl 

In  other  words,  the  unit  circle  i.<i  invariant  under  this  transformation. 


<  1  for  Izl  <  1 
—  1  for  Izl  —  1. 
>1  for  Izl  >  1 


(49b) 


(50) 


Evenly  Spaced  Frequencies  with  Flat-Input  Spectrum 

If  the  input  signal  has  a  flat-amplitude  spectrum  for  the  entire  passband,  the  resulting  LSFs  are 
evenly  spaced,  as  illustrated  in  the  far-right  end  of  Fig.  9.  For  such  an  input  signal,  the  transfer  func¬ 
tion  of  the  LPC-analysis  filter  is  unity  (which  means  that  the  prediction  residual  is  identical  to  the 
input).  It  follows  from  Eq.  (38)  that  the  moduli  of  all  roots  are  zero  (i.e.,  -  0  for  all  k's).  Hence, 

the  group  delay  as  obtained  from  Eq.  (48)  for  this  case  is 

D(w)  -  («  -f  Dr,, 


(51) 
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and  the  corresponding  phase  response  is 

—  («  +  1)  a>/,  +  constant.  (52) 


Since  the  phase  response  is  a  linear  function  cf  frequency,  phase  angles  of  0,-7r,-2ir,-3w . . . 
occur  at  a  fixed-frequency  interval.  Since  LSFs  are  those  frequencies  associated  with  these  phase 
angles,  LSFs  are  evenly  spaced.  As  noted  from  Eq.  (52),  the  phase  angle  at  0  Hz  is  0  radian,  and  the 
phase  angle  at  the  upper  cutoff  frequency  is  (n  +  Dtt  radians.  Since  LSFs  occur  at  a  multiple  of  n 
radians,  the  frequency  separation  between  two  adjacent  LSFs  is  n/(/i  -H  1)  where  11  is  the  upper  cutoff 
frequency. 


This  result  may  be  obtained  alternatively  from  the  transfer  functions  of  the  difference  and  sum 
filters,  P„+i(z)  and  0„+i(2).  For  an  input  signal  having  a  flat-amplitude  spectrum  over  the  entire 
passband,  its  autocorrelation  coefficients  are  zero  except  for  a  delay  of  zero  (i.e.,  an  impulse  function  in 
the  delay  domain).  Thus,  prediction  coefficients  are  zeros.  Hence,  as  stated  previously,  the  transfer 
function  of  the  LPC-analysis  filter  is  unity  (i.e.,  A„  (2)  -  1).  Thus,  transfer  functions  of  the  difference 
and  sum  filters,  as  obtained  from  Eqs.  (3)  and  (4),  are 

/»„h(z)- (53a) 
and 


(53b) 


The  solutions  for  these  polynomials  are  well  known  because  they  are  often  quoted  problems  in 
complex-variable  courses.  Roots  of  these  polynomials  are; 


for  P„+iiz):  2* 


exp 


(n+  1) 


k 


1,3,5,  ...  ,  rt  -I-  1, 


(54a) 


for  0«+i(^):  2k 


exp 


J2ir 
(n  +  1) 


0,2,4, 


n, 


(54b) 


where  J  -  Thus,  roots  of  the  transfer  functions  of  the  difference  and  sum  filters  are  interlaced, 
and  the  angular  distance  between  any  two  interlaced  roots  (i.e.,  LSFs)  is  equal  for  the  flat-input  spec¬ 
trum. 


Closely  Spaced  Frequencies  Near  Input-Resonant  Frequencies 

As  noted  in  Figs.  7  and  8,  LSFs  are  closer  together  near  input-resonant  frequencies.  This  is 
because  the  group  delay  of  the  ratio  filter  near  the  input-resonant  frequencies  is  larger  than  elsewhere. 


The  input-resonant  frequencies  are  reflected  in  the  roots  of  the  LPC-analysis-filter-transfer  func¬ 
tion,  A„iz).  Let  one  root  of  A„(z)  be  r„  sxp{Ju)„).  Group  delay  of  the  ratio  filter  at  the  input- 
resonant  frequency,  as  obtained  from  Eq.  (47),  is 


D(ui„)  -  /, 


1  -(- 


1  + 


1  - 


5- 


1  ~  fk 


k 

kftm 


2r*  cos  (u>„  -  w*)  l,  +  ri 


,  1  <  m  <  rt. 


(55) 


When  r„  is  near  unity  (as  would  be  the  case  with  resonant  frequencies  associated  with  vowels),  the 
second  term  of  Eq.  (55)  contributes  mostly  to  the  total  delay.  Thus,  the  group  delay  becomes  relatively 
larger  near  speech- resonant  frequencies,  as  well  evidenced  in  Figs.  7(d)  and  8(d).  An  increased  group 
delay  means  a  reduced-frequency  interval  during  which  the  angle  is  decreased  by  tr  radians.  Hence,  the 
two  adjacent  LSFs  are  closer  together. 
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Frequency  Dlstrl^  utlons 

Each  LSF  has  an  excursion  range  which  is  dependent  on  the  speech,  speaker,  and  other  factors 
such  as  the  nature  of  the  preentpha^is  prior  to  LPC  analysis  and  low-frequency-cutoff  characteristics  sf 
the  front-end-audio  circuit.  Magnitude  of  the  frequency  range  of  each  individual  LSF  is  essential  infor¬ 
mation  for  encoding  them. 

Figure  10  is  a  plot  of  LSF  distributions  computed  from  54  male  and  12  female  speakers,  each 
uttering  two  sentences.  For  each  category  of  speaker,  a  separate  frequency  distribution  plot  was  made 
for  voiced  and  unvoiced  frames  (excluding  nonvoiced  or  silent  frames).  Preemphasis  is  performed  by  a 
single-zero  filter  having  a  zero  at  z  -  15/16.  LSFs  are  computed  from  tenth-order  prediction  coeffi¬ 
cients.  Figure  10  plots  only  the  mean  value  (indicated  by  ▼  or  A)  and  the  spread  of  99%  of  samples 
(indicated  by  a  bar),  rather  than  the  probability  density  function  to  show  the  mtuor  features  of  the  fre¬ 
quency  distribution  succinctly. 


(a)  From  54  male^peakers,  2  sentences  each 


(b)  From  12  female  speakers,  2  sentences  each 


Fig.  10.  -  Distribution  of  LSFs  derived  from  tenth-order  LPC.  For  each  LSF,  the  mean  value  and  the  standard  deviation  are 
shown  for  both  voiced  and  unvoiced  frames.  LSFs  for  unvoiced  frames  are  consistently  higher  than  those  for  voiced  frames. 
Note  also  that  the  mean  values  for  the  first  two  LSFs  are  closer  together  than  the  other  adjacent  LSFs. 


According  to  Fig.  10,  there  is  no  significant  difference  between  male  and  female  voices  for  the 
sixth  through  the  tenth  LSFs  (i.e.,  for  frequencies  above  approximately  2  kHz).  Even  for  the  first 
through  the  fifth  LSFs,  the  difference  is  primarily  in  the  spread,  not  the  mean  value.  In  general,  the 
frequency  spread  is  greater  for  male  voices,  particularly  for  voiced  Fj  and  F^,  and  unvoiced  Fj.  It  is 
interesting  to  note  that  the  unvoiced  male  and  female  speech  is  virtually  identical  in  terms  of  the  LSF 
means  (Table  3).  Implication  of  this  is  that  the  unvoiced  speech  segment  does  not  contain  much  cue 
information  which  is  related  to  speaker  identification.  Note  also  that  the  mean  values  lie  very  close  to 
the  equally  spaced  LSFs  which  have  a  fiat  spectrum. 

As  an  alternate  representation  of  Fig.  10,  Fig.  1 1  plots  the  'center  and  offset  frequencies  of  LSF 
pairs  which  are  defined  respectively  as 

7*  -  \(/k  +  /*).  (56a) 

aa-|(/*-/*). 


(56b) 
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where  /*  and  /*  are  the  kth  LSFs  which  are  the  null  frequencies  of  the  sum  and  difference  filters  (see 
Fig.  7  or  8),  and  /*  is  numerically  larger  than  /*  as  listed  in  Table  2.  There  are  two  striking  features  of 
LSFs  exhibited  in  Fig.  11.  If  speech  is  voiced,  the  offset  frequency  of  the  first  LSF  pair  is  much 
smaller  than  the  others.  This  phenomenon  is  also  shown  in  the  LSF  tr^ectories  shown  in  Fig.  9.  On 
the  other  hand,  if  speech  is  unvoiced,  center  frequencies  are  greater  than  those  for  voiced  speech. 
Exploitation  of  these  two  features  can  eliminate  grotesque  voicing  errors  in  the  pitch-excited  nar¬ 
rowband  voice  processor. 

SPECTRAL  SENSITIVITY  OF  LSFs 

The  spectral  sensitivity  of  an  LSF  is  determined  by  perturbing  the  LSF  and  finding  the  resulting 
change  in  the  log  spectrum  of  an  all-pole  filter.  Error  produced  by  quantization  of  the  LSFs  will  be 
magnified  by  spectral  sensitivity  and  appear  as  spectral  error  in  the  synthesized  speech.  We  will  show 
that  not  all  LSFs  are  equally  sensitive.  Thus,  to  best  use  available  bits  during  speech  encoding, 
spectral-sensitivity  analysis  of  the  filter  parameters  is  essential. 

Observed  Characteristics 


Spectral  distortions  caused  by  LSF  errors  are  considerably  different  from  those  created  by  reflec¬ 
tion  coefficients.  Mentioning  some  of  these  differences  is  highly  instructive. 


1.  Cross-coupling  —  According  to  a  previous  investigation  [16),  spectral  sensitivity  in  terms  of 
reflection  coefficient  is  expressed  simply  as  the  logarithm  of  the  bilinearly  transformed-reflection  coeffi¬ 
cients 


8i 


1  +  kj 


/-  1,2,...  , 


(57) 


where  k/  is  the  /th  reflection  coefficient.  This  is  a  most  remarkable  fact  because  the  spectral  sensitivity 
of  a  reflection  coefficient  is  independent  of  other  coefficients.  As  evident  in  Figs.  12  through  14,  such 
is  not  the  case  with  spectral  error  caused  by  an  error  in  an  LSF.  Spectral  error  is  not  only  dependent  on 
the  error  magnitude  of  one  particular  LSF,  but  it  is  also  dependent  on  the  frequency  separations  with 
others.  If  an  LSF  which  is  removed  from  other  LSFs  has  an  error,  its  spectral  sensitivity  is  small  (Fig. 
14).  On  the  other  hand,  if  the  LSF  with  an  error  is  in  the  proximity  of  other  LSFs,  then  its  spectral 
sensitivity  is  targe  (Figs.  12  and  13).  An  LSF  may  have  an  error  in  the  low-frequency  region  (Fig.  12) 
or  in  the  high-frequency  region  (Fig.  13),  but  the  magnitudes  of  both  spectral  errors  are  large  because 
other  LSFs  are  in  the  vicinity. 


2.  Localized  spectral  error  —  Error  in  a  reflection  coefficient  produces  a  spectral  error  in  the  entire 
passband.  In  contrast,  an  error  in  an  LSF  produces  spectral  error  in  the  neighborhood  of  that  particular 
frequency  (Figs.  12  through  14).  This  phenomenon  is  somewhat  similar  to  that  of  the  channel  vocoder 
or  formant  vocoder. 


3.  Most  critical  filter  parameters  —  When  reflection  coefficients  are  used  as  filter  parameters  as  in 
the  current  narrowband  LPC,  the  first  four  coefficients  are  on  the  average  the  most  critical  parameters 
to  the  spectrum  because  these  coefficient  values  are  usually  large  for  most  vowels.  When  these  four 
coefficients  are  error  protected  in  the  narrowband  LPC  or  its  modulator/demodulator,  synthesized- 
speech  quality  is  satisfactory  even  with  a  5%  random-bit  error  (viz.,  DoD  Standard  Narrowband  LPC, 
and  HF  mode  of  Advanced  Narrowband  Digital  Voice  Terminal)  [17].  On  the  other  hand,  when  the 
LSFs  are  used  as  filter  parameters,  the  first  two  frequencies  are  on  the  average  the  most  critical  parame¬ 
ters  to  the  spectrum  because  their  frequency  separation  is  small  for  most  vowels  (Fig.  9).  Table  4  lists 
numerically  derived,  spectral-sensitivity  coefficients  from  voiced  frames  fc:  both  male  and  female 
voices.  The  first  two  LSFs  are  indeed  the  most  sensitive. 
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(a)  Two  sets  of  LSFs:  one  without  error  (solid  lines),  and  the 
other  with  a  20  Hz  error  in  the  first  LSF  (dotted  lines) 


(b)  LPC -analysis-filter-amplitude  responses  by 
the  two  sets  of  LSFs  shown  in  (a) 


(c)  Difference  between  the  two  amplitude  responses  shown  in  (b) 
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(d)  Group  delay  of  ratio  Alter 

Fig  12  —  Spectral  error  caused  by  the  error  in  a  single  LSF  (Example  #1).  LSFs  are  obtained  from  tenth-order 
LPC  analysis  with  preemphasis  by  a  single-zero  Alter  with  zero  at  z  ••  IS/ 16.  The  input  speech  is  /i/  in  "is,”  The 
spectral  error  is  created  by  perturbing  the  Arst  frequency  by  20  Hz  as  indicated  in  Fig.  12U).  Note  that  the  spec¬ 
tral  error  is  concentrated  near  the  perturbed  LSF.  The  peak  error  is  5.03  dB,  whereas  the  root-mean-square  (rms) 
error  over  the  entire  passband  is  0.71  dB  (which  is  not  a  good  indicator  for  the  peaky  error).  Note  that  the  group 
delay  of  the  ratio  Alter  at  the  perturbed  frequency  shown  in  Fig.  12(d)  is  rather  large. 
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(b)  LPC-analysis-fllter-ampIitude  responses  by 
the  two  sets  of  LSFs  shown  in  (a) 
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(c)  Difference  between  the  two  amplitude  responses  shown  in  (b) 
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Fig.  13  -  Spectral  error  caused  by  an  error  in  a  single  LSF  (Example  #2).  This  example  is  identical  to  the 
preceding  example  except  that  the  ninth  LSF  is  perturbed  by  20  Hz,  rather  than  the  first  LSF.  As  shown  in  Fig. 
13(d),  the  group  delay  of  the  ratio  filter  at  the  perturbed  frequency  is  nearly' as  large  as  that  of  the  first  LSF. 
Thus,  spectral  error  of  a  similar  magnitude  as  the  preceding  example  is  expected.  Computations  show  that  the 
peak  error  is  4.28  dB  and  the  rms  error  is  0.53  dB.  Note  that  front  vowels  (such  as  /i/  shown  in  this  figure)  have 
strong  upper  resonant  frequencies,  particularly  with  preemphasis  Hence  the  spectral  sensitivities  of  higher  order 
LSFs  are  comparable  to  those  of  the  first  two  LSFs.  Since  front  vowels  i  ccur  less  frequently  than  do  bhck  vowels, 
the  spectral  sensitivity  based  on  an  ensemble  of  many  samples  (Table  4)  does  not  indicate  the  phenomenon  shown 
in  this  figure. 
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If  LSFs  are  computed  by  the  use  of  the  ratio  filter  (i.e.,  Approach  #2  in  the  previous  discussion 
in  this  report),  we  can  obtain  the  delays  by  simply  taking  the  gradients  of  the  phase  angles  at  the  cross¬ 
ing  of  -TT,  -2ir,  -iiTy,, .  radian:  where  the  LSFs  are  located.  Thus,  group  delays  are  essentially  the 
byproduct  of  LSF  computations.  (Even  if  only  the  LSFs  are  given,  group  delays  can  be  computed  by 
transforming  the  LSFs  back  to  prediction  coefficients,  and  recomputing  the  LSFs.) 

PERCEPTUAL  SENSITIVITIES 

Inaccurate  representation  of  LSFs  caused  by  quantization  errors  results  in  spectral  error  in  the 
synthesized  speech.  Spectral  error,  in  turn,  is  perceived  by  the  human  ear.  If  the  error  is  large,  it  leads 
to  a  misidentiflcation  of  the  phoneme.  Inasmuch  as  the  human  ear  makes  the  ultimate  classification  of 
sound,  peculiarities  of  human  auditory  perception  (in  addition  to  spectral  sensitivities  of  filter  parame¬ 
ters  discussed  in  the  preceding  section)  must  be  fully  exploited  to  encode  speech  efficiently. 

Human  perception  of  spectral  distortion  is  a  complex  problem.  Equal  amounts  of  spectral  distor¬ 
tion  expressed  in  decibels  can  sound  considerably  different  to  the  listener.  For  example,  if  the  spectral 
distortion  is  stationary,  the  effect  is  much  more  bearable  to  the  listener  than  the  same  amount  of  dis¬ 
tortion  that  is  time-variant.  Likewise,  high-frequency  distortion  is  more  tolerable  to  the  listener  than 
the  same  amount  of  distortion  in  the  low-frequency  region.  Only  recently  has  parameter  quantization 
in  voice  processors  been  based  on  auditory  perception,  such  as:  masking  effect,  preference  of  the 
noise-spectral  level  below  the  speech  spectrum  (i.e.,  use  of  a  noise  shaper),  and  the  like  [18,19].  Only 
a  few  perceptual  experiments  are  presented  in  this  section;  more  experimentation  in  this  area  is  highly 
desirable. 

Perceptual  Sensitivity  to  LSF  Changes 

It  is  well  known  that  the  amount  of  frequency  variation  in  pitch  that  produces  a  perceived  just- 
noticeable  difference  (JND)  is  approximately  linear  from  0.1  to  1  kHz,  but  the  frequency  variation 
needed  to  produce  a  JND  increases  logarithmically  from  1  to  10  kHz  [20].  We  would  like  to  perform  a 
similar  experiment  with  LSFs. 

In  essence,  we  use  speech-like  sounds  rather  than  a  single  tone.  We  incrementally  change  one  of 
the  ten  LSFs  in  order  to  create  a  set  of  closely  related  sounds.  From  these  sounds,  each  listener 
decides  his  or  her  own  JND  in  terms  of  the  variation  in  one  LSF.  The  JND  is  dependent  not  only  on 
the  magnitude  of  a  frequency  shift,  but  also  the  change  of  the  spectral  amplitude.  Unfortunately,  we 
do  not  have  control  over  the  spectral  amplitude  because  it  is  implicitly  determined  by  all  the  LSFs.  As 
we  have  seen  from  Figs.  12  and  13,  spectral  change  is  particularly  large  near  two  closely  spaced  LSFs. 
To  minimize  the  effect  of  the  spectral-amplitude  change  on  the  determination  of  the  JND,  we  chose 
equispaced  initial  LSF  values. 

The  procedure  used  in  this  experiment  is  called  the  "method  of  limits"  or  "method  of  minimal 
change"  [21].  An  example  of  this  method  is  the  hearing  test  where  the  subject  determines  when  he 
first  hears  a  tone  (as  the  level  increases  or  ascends)  and  when  it  first  becomes  inaudible  (as  the  level 
decreases  or  descends).  The  average  of  these  two  values  gives  rise  to  a  threshold.  After  the  subject 
listens  to  a  number  of  the  "ascending"  and  "descending"  series  of  trials,  the  average  of  the  thresholds 
determines  an  approximate  sensitivity  for  the  listener  to  that  tone  for  the  experimental  conditions  that 
were  tested.  By  using  a  number  of  different  subjects,  a  threshold  is  determined  for  the  general  popula¬ 
tion. 


Because  we  would  like  to  test  the  sensitivity  of  human  perception  to  changes  in  LSFs,  excitation 
parameters  were  fixed  throughout  the  experiment  with  a  voiced  state  having  u  pitch  frequency  of  100 
Hz.  The  experiment  is  broken  up  into  ten  parts  corresponding  to  the  ten  LSFs.  Except  for  the  LSF 
being  perturbed,  the  remaining  LSFs  assume  the  equispaced  values  shown  in  Fig.  16.  Part  1  of  the 
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Fig.  16  -  Part  2  of  perceptual  experiment  showing  second  LSF  changing  with  time 


experiment  deals  with  determining  the  JND  in  perception  as  the  first  LSF  is  perturbed  while  the 
remaining  LSFs  (two  through  ten)  remain  constant.  Part  2  of  this  experiment  allows  the  second  LSF  to 
be  perturbed  while  the  first  v.id  third-through-tenth  LSFs  remain  fixed.  The  remaining  parts  to  this 
experiment  are  organized  in  a  like  manner. 
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Since  the  procedure  for  finding  the  JND  of  a  perturbed  LSF  is  identical  for  each  LSF,  we  will 
choose  an  LSF  (namely,  the  second  LSF)  to  describe  the  experiment  (which  we  denote  as  part  2).  In 
part  2,  the  first  and  third-through-tenth  LSFs  are  fixed  as  shown  in  Fig.  16.  The  listener  tries  to  deter¬ 
mine  the  minimal  reduction  in  frequency  of  the  second  LSF  (descending  case)  required  for  a  JND  in 
perception.  To  determine  this  minimal  amount  of  shift,  the  listener  hears  through  a  headset  the  output 
of  a  tape  recorder  having  a  series  of  computer-generated  tones,  each  of  which  is  one-half  second  in  W 

duration  with  one  second  of  silence  between  tones.  The  first  half  of  all  the  tones  heard  by  the  listener 
was  generated  with  equispaced  LSFs,  but  the  last  half  of  the  tone  (one-fourth  second  in  duration)  has 
the  second  LSF  changing  with  each  tone.  In  the  decending  case  (from  left  to  right  as  shown  in  Fig. 

16),  the  listener  determines  at  which  point  a  JND  is  perceived.  The  ascending  case  (from  right  to  left) 
is  performed  with  the  listener  determining  when  the  JND  in  the  tones  is  no  longer  heard.  The  average 
of  the  two  values  obtained  from  the  ascending  and  descending  series  of  tones  determine  a  threshold. 

The  process  just  described  is  repeated  to  reduce  the  variance.  The  two  thresholds  are  averaged  to  arrive  ^ 

at  a  threshold  for  a  particular  listener  for  the  second  LSF. 
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To  arrive  at  statistics  that  represent  a  iarger  population,  16  listeners  took  the  perception  test.  To 
minimize  the  effect  of  learning  on  the  results,  the  order  at  which  the  parts  of  the  experiment  were 
played  to  the  listeners  varied.  Also,  half  of  the  listeners  heard  the  LSFs  as  they  increased  in  index 
while  the  other  half  heard  the  LSFs  as  they  decreased  in  index. 

The  minimum  shift  needed  for  detection  is  known  as  "absolute  threshold"  or  absolute  Ilmen  (Latin 
for  threshold)  [21].  The  absolute  Ilmen  is  not  unique  as  it  will  vary  with  such  factors  as  testing  pro* 
cedures,  number  of  listeners,  and  experimental  setup. 

The  experimental  results  are  plotted  in  Fig.  17.  As  noted,  the  JND  in  change  of  frequency  is 
directly  proportional  to  frequency.  The  difference  in  perceptual  sensitivity  of  the  first  LSF  at  364  Hz 
and  that  of  the  tenth  LSF  at  3.636  kHz  is  approximately  two  to  one  which  is  quite  similar  to  that 
obtained  by  using  a  single  tone  [20]  at  these  two  frequencies. 


€ 

(A 


J _ 1 _ L 


_L 


J _ 1 _ 1 _ I 


.364  .727  1.091  1.466  1.818  2.182  2.646  2.909  3.273  3.636 

Fraquency  (kHz) 


Fig.  17  —  JND  of  perturbed  LSFs 


Sensitivity  of  DRT  Scores  to  LSFs 

The  DRT  is  a  perception  test  where  the  listener,  on  hearing  a  synthesized  word,  makes  a  choice 
between  two  rhyming  words,  one  of  which  is  correct.  These  two  words  differ  by  an  initiai  consonant 
attribute.  The  DRT  has  been  the  most  frequently  used  method  for  evaluating  synthesized  speech.  Aii 
DoD-deveioped  voice  processors  have  been  subjected  extensiveiy  to  the  DRT.  Initiaily,  the  DRT  was 
used  as  a  diagnostic  means  for  improving  the  voice  processor  under  deveiopment.  Recentiy,  however, 
it  has  aiso  become  a  means  for  ranking  voice  processors. 
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Certain  word  pairs  in  the  DRT  iist  (there  are  192  of  them)  are  rather  difficult  to  discriminate;  for 
example,  "fin"  versus  "thin,"  "sole"  versus  "thole,"  and  "fad"  versus  "that."  Even  if  we  use  unquantized 
filter  parameters,  synthesized  speech  is  often  not  clear  enough  for  a  sure  discrimination  (conventional 
pitch-excitation  signal  used  in  the  narrowband  LPC  is  partly  to  blame  for  this).  On  the  other  hand, 
some  other  word  pairs  are  not  as  difficult  to  discriminate,  for  example:  "go"  versus  "Joe,"  "chair"  versus 
"care,"  and  "dint"  versus  "tint."  For  these  words,  we  can  quantize  the  filter  parameters  considerably 
more  coarsely,  and  we  can  still  distinguish  one  word  of  the  pair  from  the  other.  In  fact,  choosing 
between  "dint"  and  "tint"  is  more  dependent  on  a  sudden  change  in  the  loudness  (i.e.,  the  excitation 
signal  parameters)  than  of  the  spectral  content  (i.e.,  filter  parameters). 

Because  the  sensitivity  of  DRT  scores  to  LSFs  is  quite  complex,  we  cannot  express  in  simple 
terms  the  relationship  between  DRT  scores  and  LSFs.  We  will,  however,  examine  a  few  special  cases 
which  are  useful  for  speech  encoding. 

Eiffect  of  Logarithmically  Quantized  LSFs  on  DRT  Scores 

The  equitempered  scale  (or  logarithmic  scale)  is  a  collection  of  frequencies  in  which  any  two  adja¬ 
cent  frequencies  form  the  same  ratio.  The  equitempered  scale  is  not  only  used  for  tuning  a  musical 
instrument,  but  it  has  been  often  used  to  quantize  some  voice  processor  parameters.  For  example,  the 
fundamental  pitch  of  the  voiced-excitation  signal  of  the  narrowband  LPC  is  quantized  with  an  equitem¬ 
pered  scale  having  20  steps  per  octave  (i.e.,  frequency  resolution  of  3.5%).  Thus,  encoding  the  pitch 
frequency  range  from  50  to  400  Hz  requires  only  60  discrete  values  (six  bits  for  binary  representation). 

Likewise,  formant  frequencies  of  a  formant  vocoder  may  be  encoded  more  efficiently  if  the  log 
scale  is  used  [22].  The  JNDs  of  formant  frequencies  are  on  the  order  of  3  to  5%  although  this  is 
greatly  dependent  on  the  proximity  of  the  formants  to  one  another.  A  frequency  resolution  of  3%  is 
equivalent  to  that  of  the  equitempered  scale  having  24  steps  per  octave.  On  the  other  hand,  a  fre¬ 
quency  resolution  of  5%  is  equivalent  to  having  15  steps  per  octave  which  is  slightly  coarser  than  the 
chromatic  scale. 

As  noted  above,  frequencies  may  be  quantized  more  efficiently  without  compromising  the 
perceptual  quality  of  the  synthesized  speech.  Thus,  we  are  interested  in  the  effect  of  logarithmically 
quantized  LSFs  on  the  DRT  score.  To  carry  out  this  investigation,  we  originally  chose  four  equitem¬ 
pered  scales  for  frequency  quantization;  24,  18,  15,  and  12  steps  per  octave.  The  corresponding  fre¬ 
quency  resolutions  are  approximately  3%,  4%,  5%,  and  6%. 

First,  we  tested  the  case  of  a  6%  frequency  resolution  (i.e.,  12  steps  per  octave).  For  this  case, 
frequencies  from  400  to  3200  Hz  (three  octaves)  are  quantized  to  3(12)-f-l  “»  37  values  (including 
both  end  frequencies).  Two  additional  frequencies  below  400  Hz  and  three  additional  frequencies 
above  3200  Hz  cover  the  entire  frequency  range  of  interest.  Thus,  the  total  number  of  frequencies  is 
only  42.  Hence,  the  number  of  combinations  for  choosing  10  frequencies  out  of  42  possible  frequen¬ 
cies  is  1,471,442,974;  —  a  binary-rated  figure  higher  than  this  number  is  2^'.  Actually,  the  number  of 
combinations  is  much  less  than  this  figure  because  certain  combinations  do  not  occur  with  speech  (i.e., 
the  tenth  LSF  would  not  likely  be  less  than  2500  Hz,  or  the  first  LSF  be  above  1200  Hz,  etc.). 

As  listed  in  Table  5,  the  resulting  DRT  score  is  only  1.8  points  below  that  for  using  unquantized 
filter  parameters,  and  virtually  equal  to  that  of  the  2400-b/s  LPC.  Even  if  all  LSFs  are  independently 
quantized  to  33  bits  per  frame,  the  result  is  as  good  as  that  of  the  2400-b/s  LPC  which  uses  41  bits  per 
frame  for  filter  parameters.  Since  the  drop  in  DRT  score  is  so  small  with  a  6%  frequency  resolution  for 
LSFs,  we  did  not  run  DRTs  for  3%,  4%,  and  5%  frequency  resolutions. 
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Table  S  —  DRT  Scores  when  LSPs  Are  Quantized  Logarithmically.  The  bits 
referred  to  are  the  number  of  bits  per  frame  used  to  encode  filter  parameters. 
For  comparison,  DRT  scores  with  unquantized  reilection  coefficients  and  the 
2400-b/s  LPC  are  also  listed.  In  all  cases,  the  excitation  signal  is  identical  to 
that  used  in  the  2400-b/s  LPC.  The  scores  are  based  on  three  male  speakers 
("LL,"  "CH,"  and  "RH").  The  use  of  LSFs  saves  at  least  8  bits  per  frame  for 
speech  quality  similar  to  2400-b/s  LPC. 


‘'First-through-tenth  reflection  coefllcients  are  independently  quantized  at  5, 5, 3,5, 4, 4, 4, 4, 3,  and 
2  bits  as  used  in  the  DoD  Standard  2400-b/s  LPC. 

^Maximum  number  of  LSF  combinations  inherent  in  a  6%  frequency  resolution  is  31  bits  (see  text). 
'Each  LSF  is  not  only  quantized  to  have  a  6%  frequency  resolution,  but  each  is  also  bounded  in  range. 
First  through  tenth  LSFs  are  represented  by  4,4,4,4,4,3,3,3,2,  and  2  bits. 


Effect  of  LSF  Elimination  on  the  DRT  Score 

Most  formant  vocoders  do  not  transmit  any  information  on  the  fourth  and  fifth  formant  frequen¬ 
cies.  One  reason  is  that  these  formants  are  not  always  present  in  speech;  therefore,  their  tracking  is  a 
mfuor  problem.  But  a  more  significant  reason  for  not  transmitting  them  is  that  speech  intelligibility  is 
adequate  using  information  related  to  the  first  three  formant  frequencies. 

We  are  interested  in  finding  out  whether  the  elimination  of  some  of  the  higher  indexed  LSFs  also 
results  in  a  graceful  degradation  of  speech  intelligibility  as  in  the  formant  vocoder.  Higher  indexed 
LSFs  describe  the  speech-spectral  envelope  in  the  high-frequency  region,  similar  to  higher  formant  fre¬ 
quencies.  We  would  like  to  know  the  DRT  score  sensitivity  when  the  higher  indexed  LSFs  are  not 
present. 

When  the  formant  vocoder  uses  only  the  first  three  formant  frequencies,  the  synthesized  speech 
sounds  somewhat  dull  because  the  speech  bandwidth  is  only  about  3  kHz  (third  formant  frequency  does 
not  swing  above  3  kHz  too  often).  This  is  not  necessariiy  the  case  for  a  voice  processor  employing 
LSFs  as  filter  parameters  because  the  receiver  can  regenerate  speech  having  the  full  bandwidth  by  rein¬ 
troducing  eliminated  LSFs.  We  can  reintroduce  eliminated  L^Fs  because  there  are  always  a  fixed 
number  of  LSFs  in  any  speech  at  any  time.  Exact  locations  of  these  high-indexed  LSFs  are  important, 
but  they  are  not  as  critical  as  one  might  expect.  Since  all  LSFs  are  naturally  ordered,  eliminated  LSFs 
must  be  reintroduced  somewhere  between  the  highest  LSF  transmitted  and  the  upper  cutoff  frequency 
(see  typical  LSF  trajectories  shown  in  Fig.  9). 

We  may  place  each  omitted  LSF  at  an  equal  spacing  from  the  highest  LSF  transmitted  to  the 
upper  cutoff  frequency.  The  spacing  will  actually  vary  from  frame  to  frame  because  the  highest  LSF 
transmitted  will  vary.  The  synthesized  speech  sounds  similar  to  that  using  all  ten  original  LSFs.  A 
casual  listener  cannot  discern  the  difference. 
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DRT  scores,  however,  do  show  some  differences.  As  listed  in  Table  6,  when  the  higheist  LSFs 
(i.e.,  ninth  and  tenth  LSFs)  are  eliminated  there  is  a  reduction  of  3.3  points  in  the  score.  As  two  addi¬ 
tional  LSFs  are  eliminated  there  is  an  additional  l.S-point  drop  in  the  score. 


Table  6  —  Three  Male  DRT  Scores  in  Terms  of  Number  of 
LSFs  Eliminated  from  the  Highest  Index.  The  eliminated  LSFs 
are  substituted  with  artificially  derived  values  at  the  receiver. 
The  excitation  signal  is  identical  to  that  used  in  the  narrowband 
LPC.  The  most  significant  degradation  occurs  for  the  attribute 
"graveness,"  which  tests  "weed"  versus  "reed,"  "bid"  versus 
"did,"  and  "peek"  versus  "teak,"  among  others.  Discrimination 
of  these  word  pairs  requires  accurate  upper  formant 
frequencies.  The  scores  are  based  on  three  male  speakers 
("LL,"  "CH,"  "RH"). 


Sound  Classification 

Number  of  LSFs  Eliminated 

0 

2 

4 

Voicing 

90,6 

89.6 

90.1 

Nasality 

95.3 

94.8 

93.2 

Sustention 

81.0 

74.7 

77.6 

Sibilation 

90.1 

90.1 

83.3 

Graveness 

87.0 

75.0 

74.2 

Compactness 

94.8 

94.5 

91.7 

Total 

89.8 

86.5 

85.0 

Instead  of  eliminating  LSFs  we  can  also  eliminate  some  of  the  offset  frequencies  of  LSF  pairs 
defined  by  Eq.  (S6b).  The  eliminated  offset  frequencies  may  be  reintroduced  at  the  receiver  based  on 
their  respective  mean  values  as  shown  in  Fig.  11.  As  noted  from  the  DRT  scores  listed  in  Table  7,  the 
two  highest  offset  frequencies  may  be  omitted  without  degrading  speech  intelligibility.  Thus,  a  tenth- 
order  LPC  system  can  use  only  eight  filter  parameters 


Table  7  —  One  Male  DRT  Scores  in  Terms  of  the  Number  of 
Offset  Frequencies  of  LSF  Pairs  Eliminated  from  the  Highest 
Index.  The  eliminated  offset  frequencies  are  substituted  with 
their  statistics  shown  in  Fig.  11.  The  excitation  signal  is 
identical  to  that  used  in  the  narrowband  LPC.  The  scores  show 
that  the  two  highest  offset  frequencies  can  be  eliminated  from 
transmission  without  degrading  intelligibility.  The  score  is 
based  on  one  male  speaker  ("RH"). 


Sound  Classification 

Number  of  Offset 

Frequencies  Eliminated 

0 

1 

2 

3 

Voicing 

90,6 

95.3 

91.4 

93.0 

Nasality 

97.7 

96.9 

95.3 

93.0 

Sustention 

76.3 

79.7 

78.7 

77.1 

Sibilation 

89,8 

89.1 

91.4 

93.0 

Graveness 

83.3 

76.1 

90.3 

74.0 

Compactness 

96.1 

95.3 

96.9 

89.8 

Total 

89.0 

88.7 

89.0 

86.7 
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IMPLEMENTATION  OF  AN  800-  AND  4800-b/s  VOICE  PROCESSOR 

We  can  use  a  single  voice  processor  to  generate  these  three  rates  because  there  are  many  compu¬ 
tation  commonalities.  For  example,  all  three  rates  require  LPC  analysis  and  synthesis.  The  800-b/s 
mode,  as  in  the  2400-b/s  LPC,  needs  a  pitch  tracker  and  voicing  decision.  With  all  of  these  voice  pro¬ 
cessing  algorithms  in  one  hardware  unit,  we  can  reduce  the  multiplicity  of  hardware  and  simplify  logis¬ 
tics. 

During  the  preamble  period,  one  data  rate  is  selected  by  the  operator,  and  the  choice  of  data  rate 
is  transmitted  to  the  receiver  as  header  data.  Based  on  the  received-header  information,  the  operator 
selects  one  of  the  three  voice  processing  algorithms  stored  in  memory.  To  make  implementation 
simpler,  the  speech  waveform  is  sampled  at  a  fixed  rate  of  8  kHz,  and  the  frame  size  is  fixed  at  180 
samples.  Likewise  the  synchronization-bit  pattern  is  identical  to  that  used  in  the  2400-b/s  LPC 
(namely,  alternating  "1"  and  "0"  every  54  bits). 

Figure  18  is  a  block  diagram  of  the  voice  processor.  As  noted,  there  are  many  shared  functional 
blocks  among  the  different  rates.  The  single-lined  blocks  are  those  used  in  the  2400-b/s  LPC,  and  they 
do  not  need  any  further  elaboration  because  they  are  well  defined  in  Federal  Standard  1015.  The 
hatched  blocks  have  been  discussed  earlier  in  this  report.  Thus,  the  present  discussion  concentrates  on 
the  heavy-lined  blocks  (encoders  and  decoders  of  the  800-  and  4800-b/s  voice  processors).  The  800- 
b/s  voice  processors  may  be  named  ^he  "pitch-excited  line-spectrum  vo(  ider,"  and  the  4800-b/s  voice 
processor  may  be  named  the  "nonpitch-excited  line-spectrum  vocoder." 

800-b/s  Encoder/Decoder 

A  data  rate  of  800  b/s  is  approximately  1%  of  the  data  rate  of  unprocessed  digitized  speech. 
When  the  data  rate  is  compressed  to  this  extreme  limit,  degradation  of  both  speech  intelligibility  and 
quality  is  inevitable.  We  would  like  to  review  the  scope  of  this  severe  compression  of  speech  informa¬ 
tion,  and  then  we  will  proceed  carefully  with  the  specihcation  of  an  800-b/s  encodcr/decoder. 

Speech  can  be  generated  at  an  average  rate  of  100  b/s  as  demonstrated  by  the  VOTRAX  speech 
synthesizer  [23].  Since  the  VOTRAX  generates  speech  by  a  set  of  rules,  the  resulting  synthesized 
speech  does  not  imitate  any  one  particular  speaker.  If  a  voice  processor  is  designed  to  imitate  an  actual 
person’s  voice,  the  required  data  rate  increases  dramatically.  For  example,  "Speak  and  Spell"  (devised 
by  Texas  Instruments  (TI))  is  a  speech  synthesizer  that  imitates  a  speaker’s  voice.  TI  analyzed  one 
person’s  voice  (a  good  broadcasting  voice  from  an  announcer  at  a  local  radio  station).  To  generate 
speech  data,  TI  segmented  speech  visually  using  a  sophisticated  interactive-computer  system.  The 
speech  data  from  each  segment  was  repeatedly  played  back  for  evaluation.  If  needed,  speech  data  from 
one  segment  was  replaced  by  other  speech  data  of  similar  sounds  in  order  to  achieve  better  speech  qual¬ 
ity.  The  resulting  speech  data  was  stored  in  "Speak  and  Spell"  for  synthesis.  Even  under  such  an  ideal 
analysis/synthesis  condition  and  using  only  one  speaker,  the  data  rate  of  "Speak  and  Spell"  varies  from 
600  to  2400  b/s  [24]  with  unknown  distribution.  If  it  is  symmetrically  distributed  about  its  mean,  the 
average  data  rate  is  1500  b/s.  We  do  not  know  what  the  data  rate  would  be  in  "Speak  and  Speil"  if  the 
number  of  speakers  is  increased,  but  we  do  know  that  even  2400  b/s  is  not  sufficient  for  eiTortlesi*  com¬ 
munications  (see  Fig.  1).  This  is  somewhat  disturbing  since  one  out  of  2^'  (2.2  trillion)  spectral  sets 
are  transmitted  when  speech  is  voiced.  As  will  be  shown,  these  2.2  trillion  spectral  sets  are  reduced  to 
3840  in  the  800-b/s  voice  processor.  This  500,000, 000-to-l  reduction  of  spectral  information  intro¬ 
duces  signihcant  speech  degradation,  particularly  in  a  multispeaker  environment  with  casual  conversa¬ 
tional  speech. 

Speech  degradation  occurs  not  only  in  conversational  tests,  but  also  in  the  DRT  scores  as  well. 
One  of  the  six  attributes  of  the  DRT  most  sensitive  to  the  size  of  available  spectral  sets  is  "graveness" 
which  tests  the  listener’s  discrimination  to  such  words  as;  "did"  versus  "bid"  and  "weed"  versus  "reed." 
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(a)  Transmitter 
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(b)  Receiver 


Fig.  18  -  Block  diagram  of  three-rate  processor.  Single-lined  blocks  are  those  used  in  DoD-standard  2400-b/s  LPC  which 
will  not  be  discussed.  The  hatched  biocks  have  been  explained  earlier  in  this  report.  Heavy-lined  blocks  are  explained  in 
this  section. 
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The  main  spectral  difference  between  these  word  pairs  is  in  the  second  and  third  formant  tr^ectories 
[251;  this  difference  becomes  ambiguous  as  the  number  of  available  spectral  sets  decreases.  The  score 
for  "graveness"  is  typically  in  the  low  90s  for  16,000  b/s  voice  processors  and  between  the  upper  70s  to 
the  lower  80s  for  the  2400-b/s  LPCs.  The  variability  of  the  score  is  greater  at  the  2400-b/s  data  rate. 
Even  a  slightly  increased  hum  level  in  the  front-end  analog  circuit  can  lower  the  score  for  "graveness" 
significantly.  According  to  available  test  data,  the  score  for  "graveness"  at  800  b/s  is  further  down  to 
somewhere  between  the  upper  50s  and  the  lower  70s.  We  note  that  a  higher  score  for  "graveness" 
resuits  in  a  higher  overall  DRT  score  because  the  scores  for  the  other  attributes  do  not  degrade  as  sig- 
nificantiy  with  a  reduction  in  the  data  rate.  This  points  out  the  importance  of  the  size  of  the  avaiiable 
spectral  sets. 

The  overall  speech  intelligibility  depends  on  many  interrelated  factors:  choice  of  filter  parameters, 
method  of  quantization,  number  of  available  spectral  sets  (which  is  dependent  on  how  pitch  and  ampli¬ 
tude  information  are  quantized),  partition  of  voiced  and  unvoiced  spectral  sets,  and  most  important, 
exploitation  of  the  spectral  sensitivity  of  the  speech  synthesis  fiiter  and  auditory  perception  characteris¬ 
tics  of  the  human  ear.  We  wili  investigate  ali  these  areas. 


Pitch  Encoder/Decoder 

To  make  more  bits  available  for  encoding  fiiter  parameters  we  will  encode  the  pitch  period  as 
coarse  as  the  ear  can  tolerate.  But  we  do  not  contemplate  using  artificial  or  constant  pitch  because  the 
use  of  naturai  pitch  is  essential  for  making  synthesized  speech  more  acceptable  to  listeners.  The  pitch 
value,  however,  need  not  be  exact  because  there  are  many  acceptable  pitch  contours  for  a  given  speech. 
According  to  an  experiment  conducted  by  Gold  and  Tierney  with  an  8-kb/s  channel  vocoder,  the  pitch 
contour  may  be  lowered  as  much  as  3%  or  raised  as  much  as  2%  without  being  sensed  by  the  listener 
[26].  The  listener  in  a  two-way  communication  link  will  not  know  the  actual  pitch  contour  because 
there  is  no  way  for  making  comparison  between  the  two  versions.  Therefore,  the  pitch  contour  can  be 
considerably  off  from  the  actual  pitch  contour. 

It  is  significant  to  note  that  pitch  has  little  influence  on  the  DRT  score.  In  fact,  use  of  constant 
pitch  produces  as  good  an  overall  DRT  score  as  the  use  of  natural  pitch  although  it  generates  mechani¬ 
cal  sounding  speech.  Table  8  shows  a  comparison  of  DRT  scores  we  recently  obtained. 


Table  8  -  Effect  of  Constant  Pitch  on  DRT  Score  for  Three 
Males.  Pitch  does  not  carry  any  information  related  to  initial 
consonant.  Therefore,  DRT  scores  are  not  affected  by  the  use 
of  constant  pitch  in  the  2400-/bs  LPC. 


Sound  Classification 

2400-b/s  LPC 

Normal 

Pitch 

Constant 

Pitch 

Voicing 

90.6 

96.4 

Nasality 

95.3 

98.4 

Sustention 

81.0 

84.4 

Sibilation 

90.1 

88.0 

Graveness 

87.0 

77.6 

Compactness 

94.8 

95.3 

Total 

89.8 

89.4 
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Because  of  the  low  listener  acceptance  of  constant  pitch,  we  transmit  natural  pitch,  but  only  once 
per  three  frames.  This  is  permissible  because  the  rate  of  change  of  pitch  is  not  as  great  as  other  param¬ 
eters.  For  a  2400-b/s  LPC,  the  pitch  resolution  is  3.5%  (i.e.,  20  steps  per  octave).  We  will  quantize 
pitch  into  5  bits,  approximately  12  steps  per  octave  firom  67  to  400  Hz  (Table  9).  This  pitch  resolution 
is  somewhat  coarser  than  what  we  desire,  but  this  is  a  tradeoff  made  to  provide  more  fliter-parameter 
sets  for  the  800-b/s  voice  processor.  We  do  not  differentially  encode  pitch  so  that  the  voice  algorithm 
responds  quickiy  to  transitions  from  male  to  female  voices,  and  vice  versa. 

Table  9  —  Quantized  Pitch  Values.  For  a  speech-sampling  fre¬ 
quency  of  8  kHz,  a  pitch  period  of  20  samples  corresponds  to  a 
fundamental  pitch  frequency  of  400  Hz,  On  the  other  hand,  a 
pitch  period  of  120  samples  corresponds  to  a  pitch  frequency  of 
66.667  Hz.  As  noted,  pitch  is  approximately  quantized  loga¬ 
rithmically  at  12  steps  per  octave  (similar  to  the  chromatic 
scale). 


Pitch 

Code 

Pitch 

Period 

Pitch 

Code 

Pitch 

Period 

Pitch 

Code 

Pitch 

Period 

0 

20 

12 

40 

24 

80 

1 

21 

13 

42 

25 

85 

2 

22 

14 

44 

26 

90 

3 

23 

15 

47 

27 

95 

4 

24 

16 

50 

28 

101 

5 

26 

17 

53 

29 

107 

6 

28 

18 

57 

30 

113 

7 

30 

19 

60 

31 

120 

8 

32 

20 

63 

9 

34 

21 

67 

10 

36 

22 

71 

11 

38 

23 

75 

Amplitude  I /{formation  Encoder/Decoder 

In  addition  to  the  pitch  period  discussed  above,  amplitude  information  is  another  nonfilter  param¬ 
eter  whose  resolution  may  be  made  coarser  to  allow  for  more  bits  to  encode  the  filter  parameters. 
Amplitude  information  is  the  rms  value  of  preemphasized  speech  for  each  frame.  It  controls  the  loud¬ 
ness  of  the  synthesized  speech.  We  will  encode  amplitude  information  into  one  of  16  3-dB  steps.  In 
comparison  with  a  2400-b/s  LPC,  it  is  1  bit  less.  To  best  use  the  available  4  bits,  we  will  use  an 
automatic  gain  control  at  the  front  end  of  the  voice  processor.  Furthermore,  we  will  multiply  the  rms 
value  of  unvoiced  speech  by  a  factor  of  two  prior  to  quantization  because  it  is  naturally  lower  in  com¬ 
parison  to  that  of  voiced  speech,  then  divide  by  a  factor  of  two  after  gain  calibration. 

Like  the  pitch  period  encoding,  we  will  not  differentially  encode  amplitude  information.  Accord¬ 
ing  to  our  experience,  differentially  encoded  amplitude  information  results  in  a  noticeable  reduction  in 
"sustention,"  one  of  the  six  attributes  of  the  DRT.  "Sustention"  tests  the  discriminability  between  "box" 
versus  "vox,"  and  "thick"  versus  "tick,"  among  others. 

Bit  Allocation 

After  bits  are  allocated  for  pitch  and  amplitude  information,  the  remaining  bits  are  assigned  to 
filter  parameters.  Since  pitch  is  transmitted  once  for  every  three  frames,  it  is  convenient  to  group  three 
frames  together  although  amplitude  information  and  filter  parameters  are  transmitted  once  per  frame. 
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Since  three  frames  are  grouped  into  one,  oniy  1  synchronization  bit  is  needed  for  three  frames.  The 
total  number  of  bits  per  three  frames  at  800-b/s  is  54  for  a  frame  rate  of  44.44  Hz  (as  used  in  the 
DoD-narrowband  LPC).  Table  10  lists  bits  aliocated  for  each  parameter.  Because  of  the  bits  saved 
during  the  encoding  of  the  pitch  period  and  the  amplitude  information,  the  total  number  of  allowable 
spectral  sets  is  as  much  as  2'^  (i.e.,  4096).  This  figure  is  two  to  four  times  greater  than  that  used  in 
some  other  800>b/s  voice  processors. 


Table  10  —  Bit  Allocation  for 
Each  Parameter 


Parameter 

Bits 

Synchronization 

1 

Pitch  Period 

S 

Amplitude  Information 

12“ 

Filter  (with  voicing  decision) 

36* 

Total 

54 

“  Derived  from  4  bits  each  from  three  frames 
^Derived  from  12  bits  each  from  three  frames 


Although  the  voicing  decision  is  often  encoded  with  a  separate  bit,  the  voiced  and  unvoiced  seg¬ 
ments  are  neither  equally  probable  nor  equally  significant.  Hence,  we  resort  to  the  voicing  decision 
being  implicitly  included  in  the  filter  parameter  information.  As  we  discuss  in  the  next  section,  4096 
spectral  sets  are  partitioned  into  voiced  and  unvoiced  spectral  sets.  When  a  voiced  spectrum  is 
transmitted,  the  pitch  excitation  is  used  for  speech  synthesis.  On  the  other  hand,  when  an  unvoiced 
spectrum  is  transmitted,  random  noise  excitation  is  used  for  speech  synthesis. 

Partition  of  Voiced  and  Unvoiced  Spectral  Sets 

The  allowable  4096  spectral  sets  are  partitioned  into  two  disjoint  sets;  one  for  voiced  speech  and 
the  other  for  unvoiced  speech.  We  need  fewer  unvoiced  spectral  sets  because  an  unvoiced  spectrum 
need  not  be  represented  precisely.  We  are  accustomed  to  hearing  a  wide  range  of  fricative  spectral  vari¬ 
ations  from  person  to  person  [27].  Furthermore,  we  identify  some  fricative  sounds  under  the  influence 
of  formant  transitions  in  the  neighboring  vocaiistic  segments  [28].  Hence,  there  is  a  many-to-one 
transform  between  the  unvoiced  speech  spectrum  aad  its  perception  to  the  human  ear.  The  2400-b/s 
LPC  exploits  this  phenomenon  by  having  a  ratio  of  allowable  voiced  spectral  sets  to  unvoiced  spectral 
sets  of  2^'  to  (i.e.,  2.7  trillion  to  1  million). 

When  4096  spectral  sets  are  available  in  an  800-b/s  voice  processor,  the  number  of  unvoiced 
spectral  sets  may  be  somewhere  around  2S6  (therefore,  the  number  of  voiced  spectral  sets  is  3840). 
This  figure  is  based  on  our  experimentation  with  a  previous  800-b/s  voice  processor  [29]  which  quan¬ 
tized  the  reflection  coefficients  vectorially  using  the  quadratic  difference  of  the  log-area-ratio  as  the  dis¬ 
tance  measure.  According  to  our  subsequent  experimentation  with  LSFs  as  filter  parameters,  we  have 
not  found  any  reason  for  changing  this  partition. 

LSF  Encoder/Decoder 

An  ideal  filter-parameter  encoder  encodes  ail  the  perceptually  indistinguishable  spectral  sets  into 
one  of  the  codes;  yet,  none  of  the  codes  represents  the  spectrum  of  sounds  unrelated  to  speech  (such 
as  the  crow  of  a  rooster).  Since  each  filter  parameter  set  represents  one  distinct  sound,  an  ideal  filter- 
parameter  encoder  has  to  be  in  the  form  of  a  block  encoder  (i.e.,  a  vector  quantizer)  with  distinct 
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sound  Spectra  in  the  memory.  The  encoder  compares  the  given  speech  spectrum  with  the  stored  spec¬ 
tral  sets,  and  transmits  the  index  of  the  nearest  neighbor  based  on  the  chosen  distance  measure.  The 
decoder  reads  out  a  set  of  Alter  parameters  that  corresponds  to  the  received  spectral  index. 

In  practice,  however,  we  will  never  be  able  to  design  such  an  ideal  encoder  as  described  above 
because:  (a)  we  do  not  have  all  the  representative  speech  samples  (males  and  females,  young  and  old, 
northerners  and  southerners,  and  normal  and  abnormal  voices)  from  which  allowable  spectral  sets  are 
extracted:,  (b)  even  if  we  have  them,  we  cannot  possibly  sort  out  all  the  bogus  spectra  generated  by  a 
mixture  of  speech  sounds  within  the  analysis  window  (this  happens  often  at  speech  transitions). 

Nevertheless,  with  some  simplifications,  approximations,  and  assumptions,  usable  vector  quantiz¬ 
ers  operating  at  a  data  rate  between  600  and  800  b/s  have  been  devised  in  recent  years.  Some  are 
based  on  the  channel  vocoder  [30,31],  others  are  based  on  the  spectral-envelope-estimation  vocoder 
[32],  or  the  LPC  (29,33].  Speech  intelligibility,  according  to  published  accounts,  varies  from  unit  to 
unit.  But,  in  general,  speech  intelligibility  at  800  b/s  is  about  five  to  ten  points  lower  than  that  of  a 
similar  device  operating  at  2400  b/s.  Thus,  there  is  an  appreciable  amount  oi  speech  degradation  from 
2400  to  800  b/s. 

In  this  report,  we  introduce  another  800-b/s  vector  quantizer  which  is  based  on  what  we  call  the 
"line-spectrum  vocoder."  In  one  sense,  this  is  an  LPC  because  filter  parameters  are  nothing  more  than 
transformed  prediction  coefficients.  In  another  sense,  this  is  somewhat  akin  to  the  channel  vocoder, 
formant  vocoder,  or  spectral-envelope-cstimation  vocoder  because  it  uses  frequencies  as  filter  parame¬ 
ters,  and  a  change  in  a  frequency  results  in  a  spectra!  change  primarily  near  that  frequency.  Thus,  the 
line-spectrum  vocoder  combines  a  good  spectral  peak  representation  capability  of  the  LPC  and  the 
frequency-selective-quantization  property  of  the  channel,  formant,  and  spectral-envelope-estimation 
vocoders. 

Furthermore,  our  distance  measure  for  selecting  a  nearest  neighbor  spectral  set  is  not  only  based 
on  the  spectral  sensitivity  of  the  individual  LSF,  but  it  is  also  based  on  the  hearing  sensitivity  of  the 
human  ear.  Inclusion  of  hearing  sensitivity  into  the  distance  measure  makes  a  great  deal  of  sense  for 
our  vocoder  application  because  the  human  ear  makes  the  ultimate  evaluation  of  speech  quality.  Per¬ 
ceptually  motivated  distance  measures  have  been  employed  previously  by  Gold  for  the  channel  vocoder 
[31],  and  Paul  for  the  spectral-envelope-estimation  vocoder  [32]. 

Distance  Measure 


Our  distance  measure  is  expressed  as  the  rms  of  the  weighted  LSF  differences  between  two  sets  of 
LSF  vectors;  namely,  [/),)  and  (Fi)  with  each  comprised  of  ten  LSF  components.  Thus, 

(w (/)[/;(/)  -  F,U)]]\ 

-  diFM.  (60) 

where  w  (/)  is  the  i  th  weighting  coefficient  that  transforms  the  LSF  difference  to  a  spectral  difference 
which  is  more  meaningful  to  our  auditory  perception.  The  weighting  coefficient,  h'(/),  is  a  nonnegative 
number  that  is  normalized  for  convenience  to  a  value  between  0  and  1 . 


If  we  arc  concerned  only  with  spectral  distortions,  the  weighting  factor  should  be  proportional 
only  to  the  spectral-sensitivity  coeiAcient  of  the  individual  LSF  (Fig.  IS)  because  it  converts  LSF 
differences  to  spectral  differences.  The  distance  measure,  normalized  to  have  a  value  between  0  and  1, 
would  be 


0.0096VDT7T  _  /TTUT 

0.0096>/^mtii 


(61) 
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where  Dil)  Is  the  group  delay  (in  milliseconds)  of  the  ratio  Alter  associated  with  the/th  LSF  for  either 
Fa  or  ft,  whichever  is  largest,  and  is  the  maximum  group  delay  observed  from  speech  O.e., 
approximately  20  ms  as  shown  in  Fig.  IS). 

The  weighting  coefncient  nxpre.<tsed  by  Eq.  (61)  does  not  account  for  our  peculiar  hearing  sensi¬ 
tivity.  We  know  that  spectral  distortions  in  the  spectral  valleys  are  perceptually  less  significant  than 
those  near  the  spectral  peaks.  According  to  Flanagan  [22],  intensity  limens  for  harmonic  components 
located  in  the  spectral  valleys  can  be  quite  large,  as  much  as  +13  dB  to  -oo  dB.  We  can  incorporate 
this  kind  of  hearing  insensitivity  to  the  weighting  coefficient  by  use  of  the  spectral-sensitivity  curve 
(Fig.  IS)  which  has  been  desensitized  for  smaller  group  delays.  As  shown  in  Figs.  8  and  12,  the  LSFs 
In  the  spectral  valleys  are  associated  with  smaller  group  delays. 

We  would  like  to  discuss  the  values  of  smaller  group  delays.  As  we  recall,  when  the  input  signal 
has  a  flat  spectrum  without  resonant  peaks  or  nulls,  all  LSFs  are  equally  spaced.  The  group  delay  of 
the  ratio  filter  at  any  LSF  (in  fact,  anywhere  in  the  passhand  for  this  particular  case)  equals  11/8000  s 
or  I.37S  ms,  assuming  a  tenth-order  LPC  and  a  4  kHz  upper  cutoff  frequency.  Smaller  group  delays 
are  referred  to  as  group  delays  less  than  1.37S  ms,  and  they  are  associated  with  LSFs  mainly  in  the 
spectral  valleys. 

We  would  like  to  lower  the  spectral-sensitivity  curve  for  smaller  group  delays.  A  simple  and  satis¬ 
factory  solution  is  to  modify  the  original  spectral-sensitivity  curve  by  a  ramp  function  for  smaller  group 
delays  where  the  ramp  function  passes  through  the  origin  and  the  original  spectral-sensitivity  curve  at 
D  1.37S  ms.  Since  the  ramp  function  is  inscribed  below  the  original  spectral-sensitivity  curve,  the 
LSF  difference  is  less  sensitive  to  the  spectral  error  for  smaller  group  delays.  Referring  to  Eq.  (59),  the 
original  spectral-sensitivity  curve  is 

E  -  0.0096VS  for  0  <  /)  <  /)„„,  (59) 

which  is  modified  to 


E 


0.0096VD 
0.0096  „ 

VOts 


for  1.375  <  D  < 
for  D  <  1.375 


(62) 


Thus,  the  weighting  factor  for  the  distance  measure  which  includes  both  the  spectral  sensitivity  of 
the  individual  LSF  and  the  hearing  sensitivity  to  spectral  distortions  near  the  spectral  valleys,  as 
obtained  by  the  use  of  Eq.  (62),  is 


for  1.375  <  Z)(/)  < 
for  DU)  <  1.375 


(63) 


where  DU)  is  the  group  delay  associated  with  the  /th  LSF  of  either  {/"aj  or  {/),),  whichever  is  larger. 


Aithougn  the  weighting  factor  expres.sed  by  Eq.  (63)  is  preferred  to  that  expressed  by  Eq.  (61)  for 
vocoder  application,  there  is  still  another  factor  that  must  be  incorporated  in  the  weighting  function; 
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namely,  a  gradual  loss  of  our  hearing  resolution  with  Increase  in  frequency.  Thus,  a  more  complete 
weighting  function  modifled  from  Eq.  (63)  is  in  the  form  of 


for  1.375  </)(/)< 
for  DO)  <  1.375 


(64) 


where  t/(/))  is  the  reiative  sensitivity  of  our  hearing  to  frequency  differehco  which  depends  on  the 
nature  of  the  tone  (Fig.  19).  A  good  approximation  to  the  relative  hearing  sensitivity  to  frequency 
difference  is 

1  for/,  <  lOOOHz 

«(/,)-  «5 

(f,  -  1000)  +  1  for  1000  <  /  <  mOMz 


Frequency  (kHzl 


Fig.  19  —  Relative  he.3ring  sensitivity  to  frequency  differences.  This  flgure 
shows  our  hearing  sensitivity  to  discriminating  frequency  difference  as  a  func' 
tion  of  frequency.  One  curve  is  based  on  the  JND  of  a  single  tone  120],  and 
the  other  curve  is  based  on  the  JND  of  pitch-excited,  speech-like  sound  with  a 
relative  flat  spectral  envelope  in  which  one  out  of  ten  LSPs  is  perturbed  (Fig. 
16).  We  expect  the  relative  hearing  sensitivity  curve  of  speech  sounds  to  be 
located  somewhere  between  these  two  curves. 


Template  Collection 

As  required  by  any  pattern-matching  or  recognition  process,  we  need  to  form  a  template  collection 
that  partitions  the  available  LSF  sets  into  a  set  of  clusters.  One  of  the  most  frequently  used  cluster- 
analysis  methods  is  the  c-mean  algorithm  [34].  This  method  is  executed  in  four  steps: 

Step  I  (Initialization):  By  some  appropriate  method  (which  is  unspecified)  partition  the  given  Y 
vectors  into  c  clusters  4'/,  J  -  1,2,,.. ,  c,  and  compute  the  mean  vectors  ot/,7  —  1,2 . c. 
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Step  2  (Classification)',  Select  a  vector  y  in  K,  and  assign  it  to  that  cluster  whose  mean  Is  closest  to 
y.  in  other  words,  assign  >  to  if 

d(y,mj)  -  m^in 

Step  3  (Updating)',  Update  mean  vector  mj,  J  -  1,2 . c. 

Step  4  (Termination)',  Go  to  Step  2  unless  a  complete  scan  of  patterns  in  Y  results  in  no  change  in 
the  cluster-mean  vectors. 

Upon  termination  of  the  algorithm,  for  y  -  1.2. ...  ,c  are  the  templates  that  are  stored  in  memory. 
The  method  described  above  has  been  used  extensively  for  the  past  20  years  [35]. 

An  alternative  approach  to  cluster  analysis  which  is  not  as  computationally  intensive,  as  described 
below,  allows  for  real-time  performance  with  the  vocoder.  An  advantage  for  using  the  vocoder 
hardware  is  that  the  templates  can  be  updated  while  two-way  conversation  is  in  progress.  Such  an 
arrangement  may  be  necessary  for  achieving  higher  intelligibility  at  low-bit  rates  in  a  multispeaker 
environment  or  at  a  noisy  speaker  site.  Another  advantage  of  the  real-time,  on-line  cluster  analysis  is 
that  it  employs  the  same  front-end  (including  the  microphone,  antialiasing  filter,  and  spectral  analysis) 
as  used  in  the  actual  communications.  (We  cannot  overemphasize  the  importance  of  using  a  matched 
front  end  for  the  training  data  collection  and  the  actual  speech  transmission.  Some  microphones  used 
in  the  military  are  far  from  ideal.  In  some  cases,  there  is  as  much  as  a  20  dB  difference  between  spec¬ 
tral  peaks  and  spectral  nulls  within  the  passband.)  One  plausible  algorithm  for  real-time,  on-line  clus¬ 
tering  analysis  is  by  way  of  successive  dichotomy  of  each  vector  from  the  training  set  into  the  following 
two  classes;  belonging  to  an  already-established  cluster  space,  or  establish  a  new  cluster  space  with  the 
new  sample  point  as  cluster  center.  Clustering  has  the  following  steps; 

Step  i.  The  first  vector  is  treated  as  the  first  template,  and  it  is  stored  in  memory. 

Step  2:  The  second  incoming  vector  is  compared  with  the  stored  template.  If  the  mutual  distance 
measured  by  the  chosen  distance  criterion  is  greater  than  a  preset  threshold  (ideally,  it  is  a  just- 
noticeable  distance)  then  the  second  vector  becomes  the  second  template.  Otherwise,  no  other  action 
is  taken. 

Step  3'.  The  subsequent  incoming  vector  is  compared  with  every  stored  template.  If  the  mutual 
distance  between  the  incoming  vector  and  any  one  of  the  stored  templates  is  less  than  the  threshold,  no 
further  comparison  is  needed  because  it  found  a  cluster  to  which  it  belongs.  If  the  mutual  distance 
between  the  incoming  vector  and  every  stored  template  is  greater  than  the  threshold,  it  becomes  a  new 
template. 

Step  4:  The  operation  indicated  by  Step  3  is  repeated  until  the  maximum  allowable  template  size 
is  reached.  (An  exhaustive  search  of  as  much  as  3840  templates  is  no  longer  a  problem  using  state-of- 
the-art  signal  processors ) 

Both  clustering  algorithms  have  been  successfully  implemented  by  others  in  800-b/s  vocoders. 
We  prefer  the  latter  approach  for  clustering  because  we  are  working  toward  the  implementation  of  a 
lower  bit-rate  vocoder  (target  rate  of  300  b/s).  We  feel  some  form  of  automatic  template  updating  is 
essential  to  achieve  acceptable  speech  quality  at  this  low  rate. 

To  make  the  cluster  hyperspace  smaller  for  a  large  size  of  training  data,  we  experimented  with  two 
sets  of  templates;  one  for  male  voices  (i.e.,  pitch  frequency  of  200  Hz  or  less),  and  the  other  for  female 
voices  (i.e.,  pitch  frequency  greater  than  200  Hz).  By  using  the  two  sets  of  templates,  the  male  DRT 
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score  remained  virtually  unchanged  (87.0  as  shown  in  the  next  section),  but  the  DRT  score  for  a 
female  voice  improved  by  two  points.  We  need  more  experimentation  along  this  line  before  we  can 
Justify  doubling  the  template  memory  size. 

DRT  Scores 

A  iow-bit<rate  vocoder,  such  as  our  800-b/s  line-spectrum  vocoder,  does  not  reproduce  speech 
clearly  enough  for  general  use  such  as  the  telephone.  On  the  other  hand,  it  is  good  enough  for  tactical 
use  where  the  messages  are  comparatively  more  structured  than  everyday  conversations.  Some  four 
years  ago,  we  had  an  experimental  real-time  800-b/s  vocoder.  Although  it  had  a  poor  DRT  score  (only 
78.3  for  three  male  speakers,  "LL,"  "CH,”  and  "RH”),  we  were  able  to  communicate  in  a  two-way 
conversational  setup.  Now  the  DRT  score  for  the  line-spectrum  vocoder  operating  at  the  same  data 
rate  is  87.0  for  the  same  speakers  (Table  11). 


Table  11  —  Three  Male  DRT  Scores  of  the  800-b/s  Line- 
Spectrum  Vocoder.  This  table  lists  the  DRT  scores  for  three 
male  speakers  ("LL,"  ‘‘CH,*  and  "RH")  for  our  iine-spectrum 
vocoder  operating  at  fixed  and  synchronous  data  rate  of  800 
b/s.  For  comparison,  the  DRT  score  foi  a  2400-b/s  LPC  is  also 
shown.  Both  vocoders  use  identical  sets  of  unquantized 
reflection  coefficients,  voicing  decisions,  and  speech  rms  values 
as  input  data.  At  the  time  of  writing  this  report  (June  1984),  a 
DRT  score  of  87  is  probably  the  highest  attained  by  any 
vocoder  operating  at  the  fixed  data  rate  of  800  b/s.  The  LSF 
templates  do  not  contain  any  LSFs  generated  from  the.se  three 
DRT  speakers. 


Sound  Classification 

800  b/s 

miimm 

Differential 

Voicing 

89.3 

Nasality 

89.8 

Sustention 

79.2 

Sibilation 

92.4 

Graveness 

79.2 

-1.5 

Compactness 

91.9 

94.8 

-2.9 

Total 

87.0 

88.4 

-1.4 

The  relatively  high  overall  DRT  score  for  our  line-spectrum  vocoder  is  a  result  of  the  relatively 
high  attribute  scores  for  "graveness"  and  "sustention."  A  higher  score  for  "graveness"  implies  that  the 
speech  spectral  envelope  is  well  characterized  by  our  filter-parameter  quantizer.  A  higher  score  for 
"sustention"  implies  that  the  speech  parameters  rise  quickly  at  abrupt  onsets.  We  believe  that  no  single 
factor  has  contributed  to  the  overall  enhancement  in  the  DRT  score,  but  rather  it  is  a  result  of  the 
accumulative  but  important  steps  we  have  made.  In  general,  vocoders  over  which  we  can  communicate 
easily  have  higher  DRT  scores  according  to  communicability  tests  conducted  at  NRL  (1).  Hence,  it  is 
essential  to  realize  a  satisfactory  DRT  prior  to  committing  any  prototype  vocoder  into  production. 

4800-b/s  Encoder/Decoder 

A  4800-b/s  voice  processor  is  needed  to  provide  communicators  with  improved  speech  quality 
when  compared  to  the  conventional  2400-b/s  LPC.  To  achieve  this  objective,  a  4800-b/s  voice  proces¬ 
sor  must  be  free  from  the  most  serious  limitations  inherent  in  the  2400-b/s  LPC;  namely,  the  speech 
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waveform  being  classified  into  either  a  periodic  or  aperiodic  waveform.  This  type  of  speech  classifica* 
tion  for  narrowband  communication  has  been  the  tradition  since  the  first  vocoder  wiis  devised  some  50 
years  ago.  It  requires  both  pitch  tracking  and  volcing-state  estimation  (Fig.  2).  Pitch  tracking  is  diffi* 
cult  to  accomplish,  and  it  tends  to  smooth  the  pitch  contour.  Natural  speech  has  many  pitch  ir¬ 
regularities,  particularly  at  voiced  onsets  and  vowei-to-vowel  transitions.  Without  the  natural-pitch  ir¬ 
regularities,  speech  tends  to  sound  mechanical.  Likewise,  voicing-state  estimation  is  even  more  difficult 
to  determine  than  pitch  tracking,  particularly  with  breath  noise  and  low-frequency,  dominant-unvoiced 
plosives  such  as  /p/. 

Recently,  however,  some  progress  has  been  made  toward  eliminating  grotesque  pitch  and  voicing 
errors.  The  use  of  high-speed  digital  signal  processing  has  made  possible  more  complex  arithmetic 
operations,  elaborate  logic  operations,  and  delay  decisions.  Yet,  the  2400-b/s  LPC  still  makes  occa¬ 
sional  pitch  and  voicing  errors.  For  example,  pitch  doubling  can  briefly  cause  a  male  voice  to  sound 
like  a  female  voice.  Voicing  error  can  cause  breath  noise  to  be  reproduced  as  a  snore  and  /p/  to 
approximate  the  sound  of  a  bilabial  fricative.  These  are  some  of  the  more  significant  reasons  why  the 
narrowband  LPC  has  not  been  universally  accepted  by  the  general  user.  Even  experienced  communica¬ 
tors  who  accept  distorted  CB  sounds  have  reservations  about  narrowband-LPC  sounds. 

Thus,  we  will  eliminate  the  use  of  both  pitch  and  voicing  in  our  4800-b/s  line-spectrum  vocoder 
as  we  have  done  previously  for  our  9600-b/s  multirate  processor  [36].  This  high-rate  device  has  been 
implemented  twice  for  real-time  operation,  and  it  has  been  extensively  tested.  In  terms  of  the  DRT,  it 
ranks  on  par  with  the  16,000-b/s  CVSD  [361.  In  terms  of  communicability,  it  is  much  closer  to  a 
32,000-b/s  CVSD  than  the  2400-b/s  LPC  (Fig.  1).  According  to  a  recent  speaker-recognition  test,  the 
9600  b/s  scored  as  high  as  a  64,000-b/s  pulse-code  modulator  (PCM).  We  also  note  that  the  communi¬ 
cators  over  this  device  indicated  their  satisfaction  in  regards  to  the  ease  and  effort  needed  to  communi¬ 
cate.  Thus,  we  are  not  misguided  if  we  design  the  4800-b/s  line-spectrum  vocoder  on  the  same  princi¬ 
ple  as  our  previous  9600-b/s  processor. 

But  we  have  to  accomplish  a  two-to-one  data-rate  compression  in  order  to  reduce  9600  b/s  to 
4800  b/s.  A  partial  solution  comes  from  the  use  of  LSFs  as  filter  parameters  rather  than  reflection 
coefficients  which  were  used  for  the  9600-b/s  data  rate.  Another  partial  solution  also  comes  from  the 
use  of  a  more  coarsely  quantized  excitation  signal  than  that  used  in  the  9600-b/s  processor.  Table  12  is 
an  example  of  a  bit  allocation  for  the  4800-b/s  line-spectrum  vocoder.  These  figures  are  justified  in  the 
following  discussion. 

LSF  Encoder/Decof  'er 

Since  filter  parameters  are  updated  once  per  frame  (i.e.,  180  speech-sampling-time  intervals), 
relatively  few  bits  are  required  to  encode  them.  For  the  800-b/s  line-spectrum  vocoder,  only  12  bits 
are  used  as  discussed  earlier.  It  is  easy  to  show  that  the  use  of  a  few  extra  bits  for  the  filter  parameters 
could  improve  the  output  speech  quality  significantly.  On  the  other  hand,  the  excitation  signal  is  a 
sample-by-sample  parameter.  Thus,  for  nonpitch  excitation  many  bits  are  required  to  encode  them,  and 
we  have  allocated  as  many  as  85  bits  (Table  12)  for  the  4800-b/s  line-spectrum  vocoder.  It  is  also  easy 
to  show  that  using  a  few  extra  bits  for  the  excitation  signal  often  does  not  noticeably  enhance  the  out¬ 
put  speech  quality.  It  is  difficult  to  know  exactly  how  many  bits  should  be  allocated  for  filter  parame¬ 
ters  and  the  excitation  signal  because  the  output  speech  quality  is  influenced  by  both.  As  an  example, 
both  the  2400-b/s  and  9600-b/s  processors  illustrated  in  Fig.  1  use  identical  filter  parameters,  yet  per¬ 
formance  at  9600  b/s  is  far  superior  because  it  uses  more  bits  for  the  excitation  signal  [36].  Our  goal  is 
to  make  the  speech-synthesis  filter  as  good  as  that  of  the  2400-b/s  LPC.  The  2400-b/s  LPC  uses  41 
bits  to  encode  10  reflection  coefficients.  If  we  were  to  encode  10  LSFs  instead,  we  can  save  10  bits 
without  hurting  the  DRT  score  (Table  5).  These  LSFs  are  quantized  to  have  frequency  resolution,  and 
the  10  LSFs  are  selected  from  the  42  frequencies  listed  in  Table  13.  We  will  use  such  a  frequency 
quantization  rule  for  r  coding  the  LSFs. 
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Table  12  —  Bit  Allocation  for  Filter  Parameters  and  Excitation 
Signal.  Each  parameter  is  renewed  once  per  frame  at  a  rate  of 
44.444  Hz.  Note  that  the  2400*b/s  LPC  uses  comparatively  few 
bits  to  encode  the  excitation-signal  parameters.  In  contrast,  the 
higher  rate  devices  use  nearly  four  times  more  bits  to  encode 
the  excitation  signal  than  the  flltor  parameters. 


Parameters 

2400-b/s* 

(bits) 

9600-b/s* 

(bits) 

4800-b/s' 

(bits) 

Filter 

HBKIflill 

Excitation  Signal 

Sync 

■m 

Total 

54 

216 

108 

%overnm«iit-Sttndtrd  2400-b/>  LPC. 
Navy  Multlrate  Prooeiaor  [36]. 
‘Llne-ip«cttum  vocoder. 


Table  i3  —  List  of  Frequencies  with  a  6%  Frequency  Resolution.  All  frequencies 
above  400  Hz  are  quantized  at  12  steps  per  octave  (i.e.,  equitempered  chromatic 
scale),  and  rounded  off  to  10  Hz.  The  two  frequencies  below  400  Hz  do  not  obey 
this  rule  because  they  occur  in  murmurs,  breath  noise,  etc.  which  are  not  critical  ele¬ 
ments  of  normal  speech. 


Index 

Freq. 

(Hz) 

Index 

Freq. 

(Hz) 

Index 

Freq. 

(Hz) 

Index 

Freq. 

(Hz) 

Index 

Freq. 

(Hz) 

"  i 

300 

3 

400 

15 

60() 

27 

1600 

i9 

“5W 

2 

350 

4 

420 

16 

850 

28 

1700 

40 

3390 

5 

450 

17 

900 

29 

1800 

41 

3590 

6 

480 

18 

950 

30 

1900 

42 

3810 

7 

500 

19 

1010 

31 

2020 

8 

530 

20 

1070 

32 

2140 

9 

570 

21 

1130 

33 

2260 

10 

600 

22 

1200 

34 

2400 

11 

640 

23 

1270 

35 

2540 

12 

670 

24 

1350 

36 

2690 

13 

710 

25 

1430 

37 

2850 

14 

760 

26 

1510 

38 

3020 

Although  we  can  encode  LSFs  independently,  we  chose  to  encode  center  and  offset  frequencies  of 
LSF  pairs,  as  defined  by  Eqs.  (56a)  and  (S6b)  because  of  the  following  advantages; 

•  the  highest-offset  frequency  can  be  eliminated  from  encoding  because  it  is  least 
significant,  as  noted  from  Table  7; 

•  fewer  bits  can  represent  all  the  center  frequencies  because  their  distributions,  as  illus¬ 
trated  by  Fig.  11,  are  almost  nonoverlapping,  particularly  for  voiced  speech; 

•  the  offset  frequency  of  a  spectrally  sensitive,  closely  spaced  LSF  pair  is  well  preserved 
because  the  offset  frequency  is  independently  quantized  with  a  minimum  step  of  one  unit 
in  terms  of  frequency  codes; 

•  and  transmission  bit  errors  affect  the  frequency  response  of  the  synthesis  filter  in  a  rela- 
latively  small  frequency  range. 
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With  some  clamping  of  the  upper  and  lower  ranges,  center  and  four  offset  frequencies  of  LSF  pairs  may 
be  encoded  as  indicated  in  Table  14.  The  total  number  of  bits  to  encode  Alter  parameters  is  only  21. 


Table  14  —  Encoded  Filter  Parameters  and  Their  Ranges  for 
4800-b/s  Line-Spectrum  Vocoder.  The  total  number  of  bits 
used  is  21,  20  bits  less  than  that  used  for  the  2400-b/s  LPC, 
but  9  bits  more  than  that  used  for  the  800-b/s  line-spectrum 
vocoder. 


Filter 

Parameters 

Frequency  Index® 

No.  of 

Bits 

1 

3,  4,  5,  6,  7,  8,  9,  10,  11 

12,  13,  14,  15,  16,  17,  18 

4 

Center 

2 

17,  18,  19,  20,  21,  22,  23,  24 

3 

Frequency 

3 

25,  26,  27,  28,  29,  30,  31,  32 

3 

of  LSF  Pair 

4 

33,  34,  35,  36 

2 

5 

38,  39,  40,  41 

2 

1 

1.  2,  4,  6 

2 

Offset 

2 

1.  2,  3,  4, 

2 

Frequency 

of 

3 

1.2,  3,  4, 

2 

LSF  Pair 

4 

1.2 

1 

5 

1  (Axed) 

0 

Total  ...  21 

‘’See  Table  13. 


Excitation-Signal  Encoder/Decofier 

The  ideal  excitation  signal  for  the  LPC  analysis/synthesis  system  is  the  prediction  residual  because 
it  can  produce  output  speech  identical  to  the  input  speech  in  the  absence  of  quantization.  Virtually  all 
LPC-based,  high-rate  voice  processors  derive  their  excitation  signals  from  the  prediction  residual.  In 
principle,  these  residual  encoding  techniques  are  applicable  to  our  4800-b/s  line-spectrum  vocoder. 
Unfortunately,  we  have  only  85  bits  to  encode  the  prediction  residual,  whereas  our  9600-b/s  processor 
has  twice  as  many  bits.  Not  all  the  residual  encoding  techniques  are  usable  when  the  number  of  avail¬ 
able  bits  is  only  85.  Our  residual-encoding  technique  is  based  on  several  levels  of  scrutiny. 

First,  we  have  to  decide  whether  the  entire  residual  samples  should  be  transmitted  with  coarser 
quantization,  or  partial  residual  samples  (typically  those  occupying  below  1  kHz)  should  be  transmitted 
with  finer  quantization.  If  the  lowband-residual  samples  are  transmitted,  the  upper  residual  samples 
must  be  regenerated  at  the  receiver.  According  to  communicability  tests  conducted  at  NRL  [1],  the 
lowband-residual  excited  LPC  is  preferred  over  the  wideband-residual  excited  LPC  because  there  is  less 
audible-quantization  noise  at  the  output. 

Once  the  lowband-residual  excitation  approach  is  selected,  there  are  still  two  possible  ways  of 
encoding  residual  samples:  encoding  time  samples  or  spectral  components.  We  chose  the  latter 
approach  because  of  the  following  advantages:  (a)  low-pass  filtering  and  down  sampling  are  not 
required;  (b)  low-frequency  components  below  250  Hz,  not  essential  to  speech  communications,  can  be 


KANO  AND  FRANSEN 


readily  eliminated  to  save  as  much  as  24  bits  (i.e.,  6  spectral  components  at  4  bits  each);  (c)  bit- 
tradeoff  between  speech  data  and  overhead  data  (sync  bits,  amplitude  normalization  factor,  etc.)  is 
more  flexible  because  a  reduction  of  one-speech-spectral  component  creates  a  small-data  package  of  4 
bits  to  encode  overhead  data;  and  (d)  upper  frequency  components  may  be  regenerated  by  simple  spec¬ 
tral  replication. 

One  drawback  of  this  spectral-encoding  method  is  that  we  need  a  tlme-to-frequency  transforma¬ 
tion  of  the  prediction  residual.  To  obtain  spectral  components  of  the  prediction  residual,  we  perform 
the  following  operations.  The  12  trailing-residual  samples  from  the  preceding  frame  are  overlapped 
with  the  180  residual  samples  of  the  current  frame  to  reduce  waveform  discontinuity  at  the  frame 
boundary.  Then,  the  192  overlapped  samples  are  "trapezoidally  windowed"  with  linear-amplitude 
weighting  over  the  12  overlapped  samples.  The  time-to-frequency  transform  is  carried  out  by  the  use 
of  a  96-point  (FFT).  The  use  of  a  half-size  FFT  reduces  computations  because  the  input-residual  sam¬ 
ples  are  real,  and  we  need  spectral  components  only  below  1  kHz.  The  maximum  amplitude  spectral 
component  below  1  kHz  is  transmitted  as  the  amplitude-normalization  factor.  It  is  quantized  to  one  of 
thirty-two  1.7S-dB  steps  covering  a  dynamic  range  of  56  dB.  Thus  S  out  of  86  bits  are  used  as  overhead 
data.  The  remaining  80  bits  are  used  for  encoding  20  spectral  components,  the  7th  through  26th  com¬ 
ponents.  Since  the  frequency  separation  is  (4000/96)  ~  41.667  Hz,  the  lowband-residual  information 
covers  the  frequency  range  from  250  Hz  to  1041.67  Hz. 

Each  of  these  20  spectral  components  may  be  encoded  in  terms  of  Us  real  and  imaginary  parts,  or 
in  terms  of  its  amplitude  and  phase  spectral  components.  We  note  that  preservation  of  phase  informa¬ 
tion  is  vital  to  the  synthesis  of  high-quality  speech  because  it  defines  how  ei^ch  spectral  component  is 
phased  in  reference  to  the  LPC  frame  which  is  not  pitch  synchronous.  Thus,  encoding  the  amplitude 
and  phase  components  is  preferred.  Although  they  may  be  encoded  independently,  we  chose  to  encode 
them  jointly  because  of  the  foll  owing  advantages:  (a)  the  number  of  amplitude  steps  can  be  traded  with 
the  number  of  phase  steps  for  improved  speech  quality;  (b)  an  amplitude-dependeit  phase  resolution  is 
feasible  (i.e.,  if  the  amplitude-spectral  component  is  —15  dB  or  less  with  respect  to  the  amplitude- 
normalization  factor,  then  the  corresponding  phase  component  may  be  quantized  more  coarsely  because 
we  cannot  hear  it  as  well  as  the  other  components);  and  (c)  we  can  have  more  diversified  phase  angles 
for  more  natural  sounding  speech. 

The  available  80  bits  are  equally  divided  for  encoding  the  20  complex-spectral  components  whose 
amplitudes  have  been  normalized  by  the  maximum  amplitude  spectral  component  (i.e.,  all  magnitudes 
are  less  than  or  equal  to  unity).  Thus,  encoding  each  spectral  component  with  4  bits  is  equivalent  to 
selecting  one  of  16  encoding  points  located  within  a  unit  circle.  These  spectral-encoding  points  are 
designed  from  the  probability-density  functions  of  both  the  residual  amplitude  and  phase  spectral  com¬ 
ponents.  Since  the  LPC-analysi.s  frame  is  not  pitch  synchronous,  the  probability-density  function  of  the 
residual-phase-spectral  components  is  random,  and  it  is  uniformly  distributed  between  —tt  and  ir  radi¬ 
ans.  Thus,  a  uniform  quantizer  may  be  used  for  phase  encoding.  The  phase  resolution  is  amplitude- 
dependent,  as  will  be  shown.  On  the  other  hand,  the  probability  density  function  of  the  residual- 
amplitude-spectral  components  is  bell-shaped  as  shown  in  Fig.  20.  Thus,  the  amplitude  quantizer  will 
have  unequal  step  sizes. 

For  a  4-bit  quantizer  of  a  complex-spectral  component,  2  bits  may  be  assigned  for  amplitude  reso¬ 
lution.  But,  according  to  our  experimentation,  a  four-level  amplitude  quantizer  does  not  leave  enough 
room  for  adequate  phase  resolution.  We  prefer  the  use  of  a  three-level  amplitude  quantizer.  The 
design  of  this  quantizer  is  based  on  the  following  amplitude  transfer  characteristics: 

yix)  -  x\/2  if  0  <  X  <  X], 
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Output  Speech  Evaluation 

Ten  years  ago  a  64,000-b/s  nonpitch-excited  vocoder  was  deployed  in  a  limited  quantity. 
Although  people  found  it  easy  to  talk  over,  its  DRT  score  was  actually  below  that  of  a  2400-b/s  vocoder 
available  at  that  time.  We  feel  that  intelligibility  is  the  most  important  performance  index  for  a  low- 
bit-rate  voice  processor  because  it  is  often  deployed  in  tactical  communications  where  conversations  are 
generally  brief.  In  some  cases,  there  may  be  no  time  to  request  the  message  again.  Fortunately,  our 
4800-b/s  line-spectrum  vocoder  scored  somewhere  between  the  2400-b/s  LPC  and  the  9600-b/s  mul¬ 
tirate  processor,  as  should  be  (Table  15). 

Table  15  —  Three  Male  DRT  Scores  of  4800-b/s  Line- 
Spectrum  Vocoder.  As  before,  the  three  male  speakers 
are  "LL,"  "CH,"  and  "RH."  For  comparison,  DRT  scores 
of  a  2400-b/s  LPC  and  the  9600-b/s  Navy  multirate 
processor  are  also  li.sted. 


Sound  Classification 

2400  b/s 

4800  b/s 

9600  b/s 

Voicing 

91.7 

94.8 

96.3 

Nasality 

93.7 

97.4 

99.2 

Sustention 

79.7 

91.1 

88.3 

Sibilation 

89.6 

93.2 

92.5 

Graveness 

80.7 

82.0 

84.4 

Compactness 

94.8 

95.1 

97.4 

Total 

88.4 

92.3 

93.0 

One  strength  of  a  nonpitch-excited  voice  processor  is  that  it  performs  much  better  with  noisy 
speech  (when  the  input  speech  is  noisy,  the  output  speech  is  similarly  noisy).  In  contrast,  the  2400-b/s 
pitch-excited  LPC  is  incapable  of  making  similar  noisy  speech  because  the  pitch-excitation  signal  (i.e., 
pulse  train)  does  not  contain  sample-by-sample  noise.  Figure  22  vividly  illustrates  the  difference 
between  the  4800-b/s  line-spectrum  vocoder  output  and  the  2400-b/s  LPC  output  when  the  input 
speech  is  noisy.  The  sound  difference  is  even  more  striking  than  the  visual  contrast  revealed  in  Fig.  22. 

CONCLUSIONS 

Reflection  coefficients  have  been  the  most  often  used  filter  parameters  to  represent  the  speech 
synthesizer  in  an  all-pole-filter  configuration.  This  report  presents  equivalent  filter  parameters,  called 
line-spectrum  frequencies,  which  are  frequency-domain  parameters.  Thus,  frequency-dependent  hear¬ 
ing  sensitivities  can  be  incorporated  into  the  quantization  process  so  as  to  represent  crudely  something 
that  is  not  readily  discernible  to  the  human  ear. 

A  benefit  of  using  line-spectrum  frequencies  is  that  the  same  level  of  initial  consonant  intelligibil¬ 
ity  is  achieved  by  using  10  bits  (approxiniately  25%)  less  than  that  required  by  reflection  coefficients. 
Furthermore,  speech  degradation  at  additional  bit  savings  is  gradual.  As  a  result,  line-spectrum  fre¬ 
quencies  may  be  used  for  implementing  an  800-b/s  voice  processor  which  is  capable  of  providing 
speech  intelligibility  several  points  higher  than  other  800-b/s  voice  processor  heretofore  tested.  Like¬ 
wise,  line-spectrum  frequencies  may  be  used  for  implementing  a  4800-b/s  voice  processor  which  is  free 
from  the  use  of  both  the  pitch  and  voicing  decision  (i.e.,  nonpitch-excited  narrowband  voice  processor). 
Both  are  welcome  additions  to  narrowband  voice  processors:  the  800'b/s  line-spectrum  vocoder  for 
transmitting  speech  over  more  constricted-information  channels,  and  the  4800-b/s  line-spectrum 
vocoder  for  achieving  a  higher  communicability  than  achievable  with  the  2400-b/s  voice  processor. 
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Fig.  22  —  Spectrum  of  noisy  input  speech,  output  speech  from  the  4800-b/s  line-spectrum  vocoder,  and 
output  speech  from  the  2400-b/s  LPC.  The  input  speech  was  taken  from  a  commercial  record  in  which  a 
newsman  is  asking  a  question  to  the  then-President  John  P.  Kennedy  in  a  White  House  conference  room. 
The  output  from  the  4800-b/s  line-spectrum  vocoder  is  a  closer  replica  of  the  original  than  the  output  of 
the  2400-b/s  LPC.  Note  the  absence  of  noise  in  the  voiced  segment  of  the  2400-b/s  LPC.  (a)  Original 
speech;  (b)  output  speech  from  4800-b/s  line-spectrum  vocoder;  (c)  output  speech  from  2400-b/s  LPC. 
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Appendix 

SUMMARY  OF  LPC  ANALYSIS  AND  SYNTHESIS 


The  basic  equations  of  LPC  analysis  and  synthesis  are  summarized  in  this  appendix.  These  equa¬ 
tions  are  well  known  In  the  speech  processing  fleld,  but  they  may  be  helpful  to  those  getting  acquainted 
lo  this  area. 

OVERVIEW  OF  LPC  ANALYSIS/SYNTHESIS 

The  LPC  analysis  decomposes  a  given  speech  waveform  into  two  component  waveforms  (Fig. 
Al).  One  waveform  is  a  set  of  slowly  time-varying  components  (i.e.,  prediction  coefficients  or  ftIUf 
co^clenis)  which  represent  the  resonance  characteristics  of  the  vocal  tract.  The  other  waveform  is  a 
wideband  signal,  called  the  prediction  residual,  which  is  the  difference  between  the  actual  and  the 
predicted  speech  samples.  The  prediction  residual  is  an  ideal  excitation  signal  for  the  speech  syn¬ 
thesizer  because  it  produces  a  synthesis  filter  output  nearly  identical  to  the  input  speech.  For  a  speech 
transmission  rate  of  2400  b/s,  the  prediction  residual  is  modeled  as  one  of  two  rudimentary  signals:  a 
pulse  train  (or  repetitive  broadband  signal)  for  voiced  sounds  and  random  noise  for  unvoiced  sounds. 
In  essence,  the  prediction  residual  is  characterized  by  three  excitation  parameters:  pitch  period,  voicing 
decision,  and  amplitude  ir^formation.  The  filter  coefficients  and  excitation  parameters  are  updated 
periodically  (every  22.5  milliseconds  (ms)). 


(a)  Transmitter  (b)  Receiver 

Fig.  Al  —  Block  diagram  of  2400-b/s  LPC 


BLOCK-FORM  LPC  ANALYSIS  AND  SYNTHESIS 

In  linear  predictive  analysis,  a  speech  sample  is  represented  as  a  linear  combination  of  past  sam¬ 
ples.  Thus, 

n 

“  Z  “/U  •«/->  + «/  /-0,1,2 . m,  (Al) 

>-i 

where  ay|„  is  the  y  th  prediction  coefficient  of  the  n  th  order  predictor,  and  e,  is  the  ith  prediction  re¬ 
sidual. 
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In  terms  of  Inner  vector  notation,  these  two  sequences  are  related  by 

/-1. 2 . n. 

By  the  Orarh-Schmidt  orthogonallzatlon  process,  yj  in  terms  of  jcy  for  >  »  1, 2, 


-  Jfi 


,  H,  are; 
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this  through  (AlO)  provide  an  Iterative  solution  for  the  reflection  coefficients.  To  find 

Zl  vPTA^^y''  of  for  >  -  1. 2 . n.  Substituting  Eqs.  (A8)  through  (AlO) 

into  Eq.  (A7)  for  each  value  of  J,  the  following  expressions  may  be  established: 

[ailiJ  « 

hiuj  fl  -ai|il(/f|| 


“113  I  “«1|1  -0212  *1 

^1213  -  0  1  -  a, 12  ^2  , 

«3|3]  0  0  1  *3 

and  so  on.  The  above  expressions  can  be  put  into  one  compact  recursive  expression: 

~  ^«+i  a«+i-yi„  y  -  1,  2 . n. 

The  transfer  function  of  the  nth  order  predictor,  denoted  by  A„(z),  is 

A„(2)  -  -  ^ay|„  ttoli  -  -  1. 

Substituting  Eq.  (All)  into  Eq.  (A  12)  gives  /4„+,(z)  in  terms  of /4,,(z).  Thus, 


where 


+  ,(z)  ™  A„(2)  -  k„  +  iz~''-^A„(z-') 
A„(z)-  k„^,B„(z), 


BJa)  -  z-'>-^A„{z-^). 

Likewise,  B„^^{^)  may  be  expressed  as 

fi„4,(z)-z-'  \B„{z)  -  k„^^A,{z)]. 
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Equation  (All)  converts  a  set  of  reflection  coefficients  to  a  set  of  prediction  coafriciant* 
Er(AilVThMr*‘**‘'**°"  coefficients  may  be  derived  from  a  set  of  renection  coefncients  by  us^  of 


«yu+i  -  «7U  ”  /c<,+ia,+i-yu  J  -  1,2 . n. 

Letting  J  be  repiaced  by  n  +  1  -  y  in  Rq.  (All)  gives 

®»t+l-yU+l  •  ««+i-y|(,  -<f,+ia7U. 


(All) 
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y  in. 

altern^^iivdy  Thus,  solving  for  or 
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